Automatic Speech Recognition Advancements for Indigenous Languages of the Americas

Romero, Monica; Gómez-Canaval, Sandra; Torre, Ivan G.

doi:10.3390/app14156497

Open AccessArticle

Automatic Speech Recognition Advancements for Indigenous Languages of the Americas

by

Monica Romero

^*

,

Sandra Gómez-Canaval

and

Ivan G. Torre

ETS of Computer Systems Engineering, Universidad Politécnica de Madrid, 28031 Madrid, Spain

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(15), 6497; https://doi.org/10.3390/app14156497

Submission received: 24 June 2024 / Revised: 18 July 2024 / Accepted: 24 July 2024 / Published: 25 July 2024

(This article belongs to the Special Issue Computational Linguistics: From Text to Speech Technologies)

Download

Browse Figures

Versions Notes

Abstract

Indigenous languages are a fundamental legacy in the development of human communication, embodying the unique identity and culture of local communities in America. The Second AmericasNLP Competition Track 1 of NeurIPS 2022 proposed the task of training automatic speech recognition (ASR) systems for five Indigenous languages: Quechua, Guarani, Bribri, Kotiria, and Wa’ikhana. In this paper, we describe the fine-tuning of a state-of-the-art ASR model for each target language, using approximately

36.65

h of transcribed speech data from diverse sources enriched with data augmentation methods. We systematically investigate, using a Bayesian search, the impact of the different hyperparameters on the Wav2vec2.0 XLS-R variants of 300 M and 1 B parameters. Our findings indicate that data and detailed hyperparameter tuning significantly affect ASR accuracy, but language complexity determines the final result. The Quechua model achieved the lowest character error rate (CER) (

12.14

), while the Kotiria model, despite having the most extensive dataset during the fine-tuning phase, showed the highest CER (

36.59

). Conversely, with the smallest dataset, the Guarani model achieved a CER of

15.59

, while Bribri and Wa’ikhana obtained, respectively, CERs of

34.70

and

35.23

. Additionally, Sobol’ sensitivity analysis highlighted the crucial roles of freeze fine-tuning updates and dropout rates. We release our best models for each language, marking the first open ASR models for Wa’ikhana and Kotiria. This work opens avenues for future research to advance ASR techniques in preserving minority Indigenous languages.

Keywords:

automatic speech recognition; natural language processing; low-resource languages; Indigenous languages; NeurIPS

1. Introduction

Indigenous languages are natural languages that have linguistically evolved in a particular region attributed to a specific community [1]. They are a paramount heritage of the evolution of language, representing the identity and culture of the local communities, and their cultural contributions constitute an immensely valuable legacy for society [2]. The lexicon and grammar of Indigenous languages contain knowledge about local ecosystems, traditional techniques, spiritual beliefs, and political organization [3]. These languages convey a diverse, rich, and ancient narrative characterized by pluralism, heterogeneity, and depth [2]. They encapsulate the wisdom and accumulated knowledge of generations, often closely tied to the land, natural resources, and the local and ecological constraints [4]. However, the small number of speaking inhabitants, the absence, in many cases, of writing traditions, the pressure of the dominant languages, or even the dissolution of the native communities have led to a continuous and progressive extinction of Indigenous languages during recent decades [5]. Actually, 7000 languages are spoken worldwide, and 6000 of these are considered Indigenous languages. However, nearly half are endangered, and approximately 1500 are at extreme risk of extinction [6]. Globalization and technological advances in artificial intelligence (AI) have further accelerated the hazard to minority languages, because solutions based on natural language processing (NLP) and AI are only available for a few dozen languages [6]. Preserving Indigenous languages is an objective that safeguards local communities’ heritage and cultural identity and is crucial for retaining critical knowledge associated with environmental interaction and understanding the social aspects of language evolution [7].

To address the gap between AI solutions for majority and Indigenous languages, the initiative Second AmericasNLP Competition Track 1 of NeurIPS 2022 [8] proposed that the participants produce ASR systems for five different Indigenous languages: Quechua, Guarani, Bribri, Kotiria, and Wa’ikhana. Bribri is a language of the Chibchan family, spoken by approximately 7500 speakers in areas of Costa Rica and Panama [9,10,11]. Meanwhile, Guarani is spoken in Paraguay, Argentina, Bolivia, and Brazil and is one of the most significant Amerindian languages in terms of numerical importance, with 6–10 million speakers [11,12,13]. Kotiria—or Wanano—is one of the sixteen languages of the eastern branch of the Tukanoan language family [14], and it is spoken in the northwest of the Amazon region, in the territory between Brazil and Colombia. Approximately 2000 speakers are distributed between Colombia and Brazil [15,16]. Wa’ikhana—or Piratapuyo—is also an eastern Tukanoan language spoken in Brazil and Colombia, with less than 2000 speakers [15,17,18]. Finally, Quechua is a heterogeneous language spoken across the Andes, stretching from southern Colombia to northwest Argentina, encompassing a diverse family of languages and dialects [19], with more than 6.5 million speakers [20].

Indigenous languages and, more generally, languages with limited or relatively fewer data available are denoted in AI and NLP as low-resource languages with a critical level of under-documentation [6]. They pose a significant challenge for developing robust and accurate ASR systems, as actual algorithms primarily rely on the availability of substantial amounts of labeled data. Despite the difficulty of the task, ASR for low-resource languages has attracted the attention of the scientific community, who have proposed new approaches to this issue, including speech augmentation techniques [21,22,23], semi-supervised models [24,25,26,27,28], fully unsupervised learning methods [29], transfer learning techniques [30,31,32], and specialized neural network architectures [33], among other solutions.

In this paper, we present our winning approach in the ASR subtask of the America’s Challenge competition of NeurIPS 2022 [34]. For this purpose, we trained and optimized an ASR system for each of the following languages: Bribri, Guarani, Kotiria, Wa’ikhana, and Quechua. This marks the first time an ASR model has been developed for the Wa’ikhana and Kotiria languages and we report the first results for these languages in the literature. We addressed the challenge of limited training data by leveraging a semi-supervised model and subsequent fine-tuning using the Wav2vec2.0 framework and applying speed augmentation techniques. The training phase involved meticulous model selection based on the optimization of performance metric hyperparameters. Additionally, we constructed comprehensive n-gram language models using text corpora for the decoding, but the Greedy Search algorithm, supplemented with heuristic corrections, showed better preciseness. Our ASR system showed an average character error rate (CER) of 26.85, thereby achieving the best solution in the competition.

The rest of the paper is organized as follows: Section 2 reviews previous research carried out in ASR for Indigenous languages. Section 3 details the experimental setup adopted for the architecture, the techniques applied during training, and the dataset creation. Section 4 shows the experimental results for our best models with the most suitable hyperparameters and configurations. Finally, in Section 5, we discuss the results and extract the highlights of this research.

2. Related Work

A limited number of studies focus on machine learning techniques applied to the Indigenous languages of Latin America, most of which focus on corpora analysis, NLP applications [35], machine translation (MT) systems [11,36], sentimental analysis [37], or just showing the challenges of this task [38]. However, only recent initiatives have aimed to address the challenge of developing ASR systems [39] for the indigenous languages of America. This issue is shared with other minority languages worldwide, and recent initiatives have been carried out to develop ASR models for the Adi (India) [40] or for the Cook Islands Māori [41] among others. For the languages that concern us in this work, there are no previously reported ASR systems for two of them: Wa’ikhana and Kotiria. In the following paragraph, we summarize the most significant investigations reported in the literature in the area of ASR for Quechua, Bribri, and Guarani.

The very first ASR system for the Quechua language was published in 2018, and it was able to recognize spoken numbers from one to ten with an accuracy greater than 90% [42]. It was based on cepstral coefficients (MFCC) features, dynamic time warping (DTW), and k-nearest neighbor (KNN) for classifying the characteristics. Then, the first complete E2E ASR system in Quechua was created, based on a pre-trained model and fine-tuned in a very limited domain of available data; the results were difficult to extrapolate to other domains [43]. The most detailed ASR system of Quechua to date uses the hidden Markov model toolkit (HTK) and evaluates the improvement when using the Gaussian HMM versus the monophonic acoustic model [44]. Later research shows that ASR in Quechua can be improved by adding to the training set’s text to speech (TTS) synthetically generated voices [45]. Similarly to Quechua, there are not many reported investigations on ASR in the Guarani language. The very first reported ASR system was based on Gaussian mixture models (GMM), HMM and n-grams [46] and trained and tested with a very limited amount of data. In 2020 and 2021, the OpenASR challenge for low-resource languages included Guarani, increasing the interest in this language [47,48]. Previous investigations included architectures based on convolutional neural networks (CNN) and factored time delay neural networks (TDNN-F), which significantly improved the previously reported results [49]. A more recent work compared three different pre-trained models fine-tuned with 10 h of labeled Guarani speech data [50]. In 2017, an automatic text-to-voice aligner for the Bribri language, with acoustic models trained on other languages, was published [51]. Initial explorations into ASR focused on Bribri investigated how vowel–tone separation can improve the accuracy of the ASR models when training data are scarce [52]. Finally, in 2023, the first ASR E2E model based on self-supervised architecture for Bribri was reported during the ASRU2023 ML-SUPERB Challenge [53].

Integrating ASR with other language processing techniques, such as word embedding, and adopting a linguistic lens could significantly improve outcomes [54,55,56], but this took a significant turn with the advent of self-supervised learning models, such as Wav2vec2.0 [26,57]. These models capitalized on the power of contrastive learning to transform raw audio data into valuable speech representations, mitigating the need for extensive transcribed data, a common challenge in Indigenous language contexts. The path forward led to exploring Wav2vec2.0’s potential for low-resource language scenarios [27], and it demonstrated remarkable adaptability and significant potential in addressing these languages [58,59] or other low-resource domains [60,61]. More recently, weak supervision emerged as a viable approach for leveraging large-scale, weakly supervised learning to harness the abundant yet imperfectly transcribed public data [62].

It can be concluded that, although ASR on Indigenous languages has recently received some attention from the scientific community, there are no standardized benchmarks for the languages under study and no reported benchmarks for two of them. In this work, we improve this situation by reporting the best benchmarks for the languages under study in the NeurIPS competition and by sharing with the community the granularity of the many training configurations experimented with, as well as the trained models.

3. Experimental Setup and Dataset Description

3.1. Main Architecture and Pre-Trained Models

The Wav2Vec 2.0 architecture, illustrated in Figure 1, constitutes a robust framework for ASR tasks. It comprises three core components: a CNN-based encoder network, a transformer-based context network, and a vector quantization module. These components work in tandem to transform raw audio samples, denoted as

x_{i} \in X

, into latent speech representations (

z_{1}, z_{2}, \dots, z_{T}

) [26]. First, the encoder network, denoted as

f : X \to Z

, plays a pivotal role in this architecture. It consists of seven sequential blocks of temporal convolution layers, each equipped with 512 channels, strategic strides (5,2,2,2,2,2,2), and kernel sizes (10,3,3,3,3,2,2). This configuration allows the encoder to compress approximately 25 milliseconds of 16kHz audio data into latent representations every 20 milliseconds. Following each convolution layer, layer normalization and Gaussian error linear unit (GELU) activation are applied to enhance feature extraction. The context network, represented as

g : Z \to C

, takes these latent representations (

z_{i}, \dots, z_{T}

) and builds context representations (

c_{i}

) that encapsulate the contextual information across the entire sequence of latent speech representations. Here, we leverage two different model sizes: one with 300 million (300 M) parameters and the other with 1 billion (1 B) parameters. The context network comprises 24 blocks (48 for the bigger model), each with a model dimension of 1024, an inner dimension of 4096, and 16 attention heads. This configuration enables it to capture intricate temporal and contextual dependencies within the audio data [26].

In the fine-tuning phase of our study, we evaluated two versions of the XLS-R (cross-lingual speech recognition) model framework: XLS-R-300M and XLS-R-1B [63]. The numbers refer to the quantity of trainable parameters that each model possesses. These models were initially pre-trained on a vast dataset consisting of

436,000

h of publicly available multilingual speech data, in 128 languages, sourced from various repositories, including CommonVoice [64], Babel [65], Multilingual Librispeech (MLS) [66], VoxPopuli [67], and the VoxLingua107 [68] datasets. During the fine-tuning phase, we assessed the performance of the XLS-R-300M and XLS-R-1B models for each Indigenous language separately, utilizing the available training data for each language.

3.2. Data

For this work, we have used transcribed speech data collected from the following two sources: (i) the primary dataset provided by the organization of the AmericasNLP challenge and (ii) a corpus collected from other publicly available sources. The NeurIPS competition advocated using ancillary external resources to complement the primary dataset during the training, but imposed usage restrictions on some databases. A summarized description of the train-dev-test database is depicted in Table 1.

The database published by the competition included

0.72

h from The Pandialectal Corpus of the Bribri Language [69],

0.46

h from the Guarani Common Voice Mozilla database [64],

3.49

h from the Endangered Languages Archive (ELAR) (dk0137 and dk0491) for Kotiria [70,71],

1.8

h from ELAR for Wa’ikhana [71], and

3.95

h from the Siminchik dataset for Quechua [72]. All of these mentioned databases were subject to restrictions during the competition; it was not permitted to use data from those sources differently than set out by the rules provided by the organization. In addition to the primary dataset, we enhanced the training process by collecting supplementary speech data. Specifically, we amassed

1.14

h of transcribed speech for the Bribri language [73],

21.8

h for Kotiria [74], and

7.04

h for Quechua [75]. However, it should be noted that no external data collection was obtained for Guarani or Wa’ikhana.

To augment the diversity and variability of the assembled speech audios, we applied offline speed augmentation techniques [21] to the primary database. Specifically, we augmented the audios at

\times 0.9

and

\times 1.1

speed rates, effectively widening the spectrum of speech patterns encountered during training. This augmentation technique fortifies ASR models and enhances their adaptability to varied speech styles and tempo changes [21]. Summing up, the cumulative duration of recordings in the training dataset varied across different languages as follows (see Table 1):

2.61

h for Bribri, less than one hour for Guarani,

29.87

h for Kotiria,

4.35

h for Wa’ikhana, and

12.05

h for Quechua.

3.3. Decoding Strategy and Language Models

Different strategies were considered during decoding: greedy decoding and beam search n-gram language model decoding. For the beam search, 3-gram and 4-gram KenLM [76] models were constructed for each language. For each configuration, two models were trained depending on the data: one included only training transcription data, and the other included primary transcription data and text data from other corpora. Moreover, we constructed extensive language models for each target language by crawling text corpora spanning diverse sources like speech transcriptions, online texts, and books. The size of the text corpora collected for training the language models was relatively modest, with fewer than 100k words secured for each language. Beam search hyperparameters

α

and

β

were selected based on the dev set and searched during 50 trials with Bayesian optimization techniques, by fixing a beam width of 128.

However, due to the lack of a standard normalization of the transcriptions and the low amount of data, this optimization did not lead to significant improvement and even degraded performance for some languages. Therefore, the final decoding strategy was only based on greedy search and heuristic corrections applied to correct textual errors such as capitalization, punctuation, and reducing multiple spaces or letters.

3.4. Hyperparameter Fine-Tuning

We delved deeper into the experimental findings to understand the contribution of different hyperparameters on the performance of the language models, specifically focusing on the variants of the Wav2vec2.0 XLS-R model: 300 M and 1 B parameters. We studied and analyzed the impact of the most important hyperparameters: learning rate, maximum number of updates, freeze fine-tune updates, activation dropout, mask probability, and mask channel probability. Hyperparameters serve as control parameters that govern various facets of the training process, thereby exerting a profound influence on model performance. However, while the best hyperparameters are usually published, it is not usual to find an exhaustive study dissecting their individual contribution impacts and their complex correlations.

Our methodology for hyperparameter optimization revolves around an exhaustive Bayesian search [77] with the help of the Optuna hyperparameter optimization framework [78]. We traverse a wide spectrum of hyperparameter configurations (see Table 2), assessing model performance at each juncture in order to determine the most effective parameter settings for our specific task and model variations. Finally, two model checkpoints are considered during the test phase: the one with the lowest loss and the one with the lowest word error rate (WER) during training.

3.5. Sobol’ Sensitivity Analysis for Hyperparameter Explanation

Sobol’ sensitivity analysis is a relatively recent technique used for estimating the influence of individual variables on the output of complex mathematical models [79]. While it has been widely adopted across diverse scientific disciplines ranging from epidemiology [79] to ecology [80,81], it has only been recently introduced to the study of AI complex models [82,83]. We have utilized Sobol’ sensitivity analysis to assess the impact of various hyperparameters on the performance of our language models designed for Quechua, Kotiria, Bribri, Guarani, and Wa’ikhana, producing new valuable and fresh insights in the realm of ASR.

Sobol’ sensitivity analysis measures how much of the variance in the model output can be attributed to the specific inputs or set of inputs. This approach considers only the inputs and outputs of the system under study, considering the function transformation as a black box. It provides two essential indices: the first-order (S1) index, which quantifies the individual contribution of each input parameter to the output variance, and the total-order (ST) index, which takes into account not only each parameter but also its interactions with all other parameters in the model [79,84].

4. Results

In this section, we delve into the outcomes of our investigation, where we explore the intricate interplay between hyperparameter configurations and the performance of diverse language models. The ensuing analysis, encapsulated in Table 3, summarizes the best hyperparameter configuration obtained for each language.

The Quechua language model, benefiting from a substantial training dataset of

12.09

h, achieved the best performance, with a WER of

48.98 %

and a CER of

12.14 %

. This performance could be influenced not only by the amount of data but also by the quality and diversity of the data. However, our investigation also unveiled a counter-intuitive observation: the Kotiria language model, with the most extensive dataset of

29.92

h (see Table 1), showed the highest WER (

79.69 %

) and CER (

36.59 %

) rates. This discrepancy could arise from language complexity, data quality, or diversity. Remarkably, the Guarani language model, trained on the smallest dataset (

0.97

h), achieved a WER of

62.91 %

and a CER of

15.59 %

. The notable effectiveness of meticulous hyperparameter tuning, even with limited data, highlights the importance of this process. The precise adaptation of hyperparameters can somewhat compensate for data scarcity, optimizing model performance. This trend extended to the Bribri and Wa’ikhana language models, which exhibited intermediate performance levels despite varying dataset sizes (

2.38

and

6.11

h, respectively). These outcomes, as shown in Table 3, underscore the pivotal role of hyperparameter tuning in determining model performance. Additional configurations and results can be found in the Supplementary Information.

In addition, our research examined two variants of the Wav2vec2.0 XLS-R model: one with 300 million parameters and another with 1 billion parameters. Despite the difference in parameter count, both models yielded comparable results, except for Kotiria, where the bigger model showed significantly better performance. This suggests that the choice between these two models may be influenced by resource constraints, where the smaller model may be preferable for applications with limited resources, while the larger model could offer slightly improved performance when more data are available.

Analyzing the hyperparameters across different languages, it is noticeable that the learning rate is consistently set at either

1 \times 10^{- 5}

or

1 \times 10^{- 4}

, although the search range is wider (see Table 2). This low learning rate aligns with previously reported investigations [26,29,61], allowing precise weight adjustments during the fine-tuning phase. The learning rate is highly related to the maximum number of updates. In this case, the total number of iterations the model trains varies widely, from 40k to 130k. However, no apparent simple relationship exists between the number of updates, the optimization step size, or the available training datasets illustrated in Figure 2. This suggests the requirement of different amounts of training depending upon the complexity and uniqueness of the languages. The fine-tuning freezing point, denoting the end at which specific model layers are frozen and not updated, seems to increase with the maximum number of updates for the Quechua, Bribri, and Guarani languages. However, this pattern does not hold for the Wa’ikhana language, which combines a lower freezing point with more updates. The mask channel values hover between

0.25

and

0.5

, implying the significance of masking specific inputs during training, potentially reducing overfitting. Finally, activation dropout rates are consistently low across the models, between

0.1

and

0.2

, suggesting a regularization strategy to prevent overfitting while maintaining the capability to learn complex patterns.

Specific hyperparameters did not correlate directly with lower WER or loss values. This suggests that the optimal hyperparameters are highly dependent on language-specific characteristics. Nonetheless, it is worth noting that, despite similar learning rates, mask channel probabilities, and activation dropout rates, the Guarani language model achieved notably lower WER and loss values, hinting at the influence of factors like dataset quality, model architecture, or inherent language characteristics.

The Sobol’ sensitivity analysis reveals that the number of fine-tuning updates and the activation dropout rate are critical for model performance. This indicates that careful attention to these hyperparameters can significantly improve results. Interestingly, the low impact of the maximum number of updates suggests that other factors, such as layer freezing and regularization, are more influential in data-limited contexts.

Table 4 presents the outcomes of the Sobol’ sensitivity analysis, offering insights into the relative significance of each hyperparameter in influencing the WER of the language model. This analysis delves into the influence of six key hyperparameters: learning rate, max number of updates, freeze fine-tune updates, mask prob, mask channel prob, and activation dropout, on the model’s WER. The first-order sensitivity (S1) index measures the individual contribution of the parameter to the total output variance, while the total sensitivity (ST) index reflects the total contribution of a parameter, including its interactions with the other hyperparameters, to the final output. Interestingly, hyperparameters with the most substantial individual influence on the WER (S1), e.g., learning rate (ST = 0.57) or masked channel probability (S1 = 0.54), do not have so high an impact when considering their interactions with other hyperparameters (ST = 0.26 and ST = 0.28, respectively). Notably, the hyperparameter with a higher ST is the freeze fine-tuning updates, suggesting that a bad choice could lead to forgetting the info in the pre-trained layers or not adapting the pre-trained neural network to the current domain. On the other hand, unexpectedly, the maximum number of updates during training seems not so sensible, although it is usually a parameter to which the research community shows more attention. Finally, activation dropout is the second most important contribution to the output variance on the ST, which is somehow expected, as it is a sensible tuning that avoids overfitting but which, in turn, can make training difficult when not properly adjusted.

5. Conclusions and Future Work

In this work, we have fine-tuned and reported the very first ASR models for the Wa’ikhana and Kotiria languages and have additionally established new benchmarks within the dataset of the Americas NLP Challenge 2022 for Guarani, Quechua, and Bribri, marking a significant step forward in this field. These results represent an important contribution to the study of ASR in Indigenous languages, in which only a few dozen studies are reported for these languages. Additionally, we have published our best models in a repository and the full hyperparameter experiments in the SI.

We show that pre-trained models are very promising when fine-tuning them to some unknown domain with a very low amount of data but that bigger architectures do not always achieve better results. This may be due to the fact that, when the amount of data is very small and also very far from the pre-trained domain, it is more difficult to converge to an optimal solution if the number of trainable parameters is too large. This would imply that for some corner cases and minority languages, the race to use architectures with an increasing number of parameters may not be the most effective. Surprisingly, we did not find a clear relationship between the number of hours available for each language and the accuracy of the fine-tuned models. This may be related to the phonemic distance between the pre-trained languages and the fine-tuned domains, which may have made the quantized latent pre-trained representations unsuitable for all the target languages.

The application of Sobol’ sensitivity analysis has allowed us to identify critical hyperparameters that impact language model performance in ASR tasks. These insights inform future hyperparameter tuning efforts, enhancing the accuracy and efficiency of ASR systems for under-resourced languages. We unveil that, when considering internal correlations, the number of epochs where the pre-trained layers of the models are frozen is significantly more important than the total number of trained epochs. This indicates that, when a low amount of data is available for fine-tuning, the greatest effort has to be made in inefficiently adjusting the weights of the neural network in the last layers rather than in adjusting its weights across the architecture. The second most important factor unveiled by Sobol’ analysis is the dropout rate, which shows that, when the number of trainable parameters is so large and the amount of data is so scarce, the risk of overtraining is very high. This comprehensive evaluation of hyperparameters provides a valuable tool for optimizing ASR models, which is particularly crucial for languages with limited resources and data availability.

While our proposed ASR system demonstrates significant advancements for the Indigenous languages of the Americas, several limitations remain. First, the relatively small size of the training datasets limits the model’s ability to generalize across diverse speech patterns and dialects. Additionally, the quality and variety of the data used for training and evaluation can impact the performance, as some languages had fewer resources available. Another limitation is the dependency on pre-trained models, which may not fully capture the unique phonetic and grammatical structures of these languages. Future work should focus on expanding the datasets with more diverse and representative samples, improving data collection methodologies, using transfer learning and knowledge distillation, and exploring different model architectures—such as others based on conformers—that can better handle the linguistic complexities of Indigenous languages [85]. Additionally, bimodal or multimodal recognition could be particularly beneficial for low-resource language ASR systems [86]. Additionally, techniques successfully used in studies focusing on ASR for individuals with speech disorders could also benefit the domain of Indigenous languages, wherein the many of the phonemes are unseen during the pertaining phase and the data size is small [87]. Furthermore, incorporating community feedback and collaboration will be essential to ensure these ASR systems are accurate, culturally appropriate, and beneficial for the preservation and revitalization of these languages.

To round off, we hope that this work paves the way for new research avenues in ASR for Indigenous languages and minority domains in general. Future research could explore advanced methods, such as automated machine learning (AutoML), for a more exhaustive exploration of hyperparameter spaces, potentially resulting in more optimized models. Furthermore, advanced regularization techniques could mitigate model complexity and overfitting, especially in languages with limited data. Collecting more data for under-resourced languages is important, but the emphasis should be on capturing linguistic intricacies, dialectal variations, and cultural contexts to train more robust and culturally sensitive models. Furthermore, exploring the potential benefits of transfer learning from high-resource languages to less-resourced languages is an avenue worth pursuing. Cross-linguistic knowledge transfer could help mitigate the challenges posed by data scarcity and enhance model performance for underrepresented languages. These and other possible tasks will help to democratize access to AI speech models for the thousands of languages across the Earth.

6. Data and Models Accessibility

The data utilized for fine-tuning, provided by the AmericasNLP challenge, are accessible online [88]. The results obtained by performing experiments with different hyperparameters and configurations are available in the Supplementary Information (SI). The fine-tuned models for each language, including Quechua, Bribri, Kotiria, Guarani, and Wa’ikhana, are accessible online [88]. The models can be accessed at the following URLs:

Quechua (https://huggingface.co/ivangtorre/wav2vec2-xlsr-300m-quechua) (accessed on 12 September 2023)
Kotiria (https://huggingface.co/ivangtorre/wav2vec2-xlsr-300m-kotiria) (accessed on 12 September 2023)
Wa’ikhana (https://huggingface.co/ivangtorre/wav2vec2-xlsr-300m-waikhana) (accessed on 12 September 2023)
Guarani (https://huggingface.co/ivangtorre/wav2vec2-xlsr-300m-guarani) (accessed on 12 September 2023)
Bribri (https://huggingface.co/ivangtorre/wav2vec2-xlsr-300m-bribri) (accessed on 12 September 2023).

Scripts prepared for downloading the fine-tuned models and performing inference are now available at https://github.com/monirome/asr-indigenous-languages (accessed on 12 September 2023) [88].

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/app14156497/s1. Table S1: Detailed fine-tuning results for the Quechua ASR model, illustrating WER variations based on hyperparameter configuration. Table S2. Detailed fine-tuning results for the Kotiria ASR model, illustrating WER variations based on hyperparameter configuration. Table S3. Detailed fine-tuning results for the Bribri ASR model, illustrating WER variations based on hyperparameter configuration. Table S4. Detailed fine-tuning results for the Guarani ASR model, illustrating WER variations based on hyperparameter configuration. Table S5. Detailed fine-tuning results for the Wai’khana ASR model, illustrating WER variations based on hyperparameter configuration.

Author Contributions

Conceptualization, I.G.T.; Methodology, I.G.T.; Software, M.R.; Investigation, M.R. and I.G.T.; Resources, M.R.; Data curation, M.R.; Writing—original draft, M.R. and I.G.T.; Writing—review & editing, M.R., S.G.-C. and I.G.T.; Supervision, S.G.-C. and I.G.T.; Funding acquisition, S.G.-C. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in the study are included in the article/Supplementary Materials, further inquiries can be directed to the corresponding authors.

Acknowledgments

The authors gratefully acknowledge the Universidad Politécnica de Madrid for providing computing resources on the Magerit Supercomputer.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Thiede, B.C.; Gray, C. Characterizing the indigenous forest peoples of Latin America: Results from census data. World Dev. 2020, 125, 104685. [Google Scholar] [CrossRef] [PubMed]
UNESCO. How Can Latin American and Caribbean Indigenous Languages Be Preserved? 2021. Available online: https://unesdoc.unesco.org/ark:/48223/pf0000387186 (accessed on 2 July 2023).
McQuown, N.A. The indigenous languages of Latin America. Am. Anthropol. 1955, 57, 501–570. [Google Scholar] [CrossRef]
Chiblow, S.; Meighan, P.J. Language is land, land is language: The importance of Indigenous languages. Hum. Geogr. 2022, 15, 206–210. [Google Scholar] [CrossRef]
UNESCO. Indigenous Languages: Gateways to the World. 2022. Available online: https://www.unesco.org/en/articles/cutting-edge-indigenous-languages-gateways-worlds-cultural-diversity (accessed on 2 July 2023).
Global predictors of language endangerment and the future of linguistic diversity. Nat. Ecol. Evol. 2020, 6, 163–173. [CrossRef]
Ferguson, J.; Weaselboy, M. Indigenous sustainable relations: Considering land in language and language in land. Curr. Opin. Environ. Sustain. 2020, 43, 1–7. [Google Scholar] [CrossRef]
Mager, M.; Kann, K.; Ebrahimi, A.; Oncevay, F.; Zevallos, R.; Wiemerslage, A.; Denisov, P.; Ortega, J.; Stenzel, K.; Alvarez, A.; et al. La Modelización de la Morfología Verbal Bribri. 2023. Available online: https://neurips.cc/virtual/2022/competition/50096 (accessed on 12 August 2023).
Umaña, A.C. Chibchan languages. In The Indigenous Languages of South America; Campbell, L., Grondona, V., Eds.; De Gruyter Mouton: Berlin, Germany, 2012; pp. 391–440. [Google Scholar] [CrossRef]
Feldman, I.; Coto-Solano, R. Neural Machine Translation Models with Back-Translation for the Extremely Low-Resource Indigenous Language Bribri. In Proceedings of the 28th International Conference on Computational Linguistics, International Committee on Computational Linguistics, Barcelona, Spain, 8–13 December 2020. [Google Scholar] [CrossRef]
Kann, K.; Ebrahimi, A.; Mager, M.; Oncevay, A.; Ortega, J.E.; Rios, A.; Fan, A.; Gutierrez-Vasques, X.; Chiruzzo, L.; Giménez-Lugo, G.A.; et al. AmericasNLI: Machine translation and natural language inference systems for Indigenous languages of the Americas. Front. Artif. Intell. 2022, 5, 266. [Google Scholar] [CrossRef] [PubMed]
Adelaar, W. Guaraní. In Encyclopedia of Language & Linguistics, 2nd ed.; Brown, K., Ed.; Elsevier: Oxford, UK, 2006; pp. 165–166. [Google Scholar] [CrossRef]
Costa, W. ‘Culture Is Language’: Why an Indigenous Tongue Is Thriving in Paraguay. 2020. Available online: https://www.theguardian.com/world/2020/sep/03/paraguay-guarani-indigenous-language (accessed on 10 July 2023).
Stenzel, K. Kotiria ’differential object marking’ in cross-linguistic perspective. Amerindia 2008, 32, 153–181. Available online: https://amerindia.cnrs.fr/wp-content/uploads/2021/02/Stenzel-Kristine-Kotiria-differential-object-marking-in-cross-linguistic-perspective.pdf (accessed on 12 September 2023).
Endangered Language Project. Endangered Language Project Catalogue. 2023. Available online: https://www.endangeredlanguages.com/ (accessed on 12 July 2023).
Crevels, M. Language endangerment in South America: The clock is ticking. In The Indigenous Languages of South America; Campbell, L., Grondona, V., Eds.; De Gruyter Mouton: Berlin, Germany, 2012; pp. 167–234. [Google Scholar] [CrossRef]
Ethnologue. Languages of the World. 2023. Available online: https://www.ethnologue.com/ (accessed on 12 July 2023).
UNESCO. World Atlas of Languages. 2023. Available online: https://en.wal.unesco.org/world-atlas-languages (accessed on 12 July 2023).
Pearce, A.J.; Heggarty, P. “Mining the Data” on the Huancayo-Huancavelica Quechua Frontier. In History and Language in the Andes; Heggarty, P., Pearce, A.J., Eds.; Palgrave Macmillan US: New York, NY, USA, 2011; pp. 87–109. [Google Scholar] [CrossRef]
Lagos, C.; Espinoza, M.; Rojas, D. Mapudungun according to its speakers: Mapuche intellectuals and the influence of standard language ideology. Curr. Issues Lang. Plan. 2013, 14, 105–118. [Google Scholar] [CrossRef]
Ko, T.; Peddinti, V.; Povey, D.; Khudanpur, S. Audio augmentation for speech recognition. In Proceedings of the Sixteenth Annual Conference of the International Speech Communication Association, Dresden, Germany, 6–10 September 2015. [Google Scholar]
Ko, T.; Peddinti, V.; Povey, D.; Seltzer, M.L.; Khudanpur, S. A study on data augmentation of reverberant speech for robust speech recognition. In Proceedings of the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, USA, 5–9 March 2017; pp. 5220–5224. [Google Scholar]
Park, D.S.; Chan, W.; Zhang, Y.; Chiu, C.C.; Zoph, B.; Cubuk, E.D.; Le, Q.V. Specaugment: A simple data augmentation method for automatic speech recognition. arXiv 2019, arXiv:1904.08779. [Google Scholar]
Synnaeve, G.; Xu, Q.; Kahn, J.; Likhomanenko, T.; Grave, E.; Pratap, V.; Sriram, A.; Liptchinsky, V.; Collobert, R. End-to-end asr: From supervised to semi-supervised learning with modern architectures. arXiv 2019, arXiv:1911.08460. [Google Scholar]
Xu, Q.; Likhomanenko, T.; Kahn, J.; Hannun, A.; Synnaeve, G.; Collobert, R. Iterative pseudo-labeling for speech recognition. arXiv 2020, arXiv:2005.09267. [Google Scholar]
Baevski, A.; Zhou, H.; Mohamed, A.; Auli, M. wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations. arXiv 2020, arXiv:2006.11477. [Google Scholar]
Yi, C.; Wang, J.; Cheng, N.; Zhou, S.; Xu, B. Applying wav2vec2.0 to Speech Recognition in various low-resource languages. arXiv 2020, arXiv:2012.12121. [Google Scholar]
Parikh, A.; ten Bosch, L.; van den Heuvel, H.; Tejedor-Garcia, C. Comparing Modular and End-To-End Approaches in ASR for Well-Resourced and Low-Resourced Languages. In Proceedings of the 6th International Conference on Natural Language and Speech Processing (ICNLSP 2023), Virtual, 16–17 December 2023; pp. 266–273. [Google Scholar]
Baevski, A.; Hsu, W.N.; Conneau, A.; Auli, M. Unsupervised speech recognition. Adv. Neural Inf. Process. Syst. 2021, 34, 27826–27839. [Google Scholar]
Wang, D.; Zheng, T.F. Transfer learning for speech and language processing. In Proceedings of the 2015 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA), Hong Kong, China, 6–19 December 2015; pp. 1225–1237. [Google Scholar]
Kunze, J.; Kirsch, L.; Kurenkov, I.; Krug, A.; Johannsmeier, J.; Stober, S. Transfer learning for speech recognition on a budget. arXiv 2017, arXiv:1706.00290. [Google Scholar]
Yi, J.; Tao, J.; Wen, Z.; Bai, Y. Language-adversarial transfer learning for low-resource speech recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 2018, 27, 621–630. [Google Scholar] [CrossRef]
Yu, Z.; Zhang, Y.; Qian, K.; Wan, C.; Fu, Y.; Zhang, Y.; Lin, Y.C. Master-ASR: Achieving Multilingual Scalability and Low-Resource Adaptation in ASR with Modular Learning. In Proceedings of the 40th International Conference on Machine Learning, Honolulu, HI, USA, 23–29 July 2023; pp. 40475–40487. [Google Scholar]
Ebrahimi, A.; Mager, M.; Wiemerslage, A.; Denisov, P.; Oncevay, A.; Liu, D.; Koneru, S.; Ugan, E.Y.; Li, Z.; Niehues, J.; et al. Findings of the Second AmericasNLP Competition on Speech-to-Text Translation. In Proceedings of the NeurIPS 2022 Competition Track, PMLR, New Orleans, LA, USA, 28 November–9 December 2022; pp. 217–232. [Google Scholar]
Mager, M.; Gutierrez-Vasques, X.; Sierra, G.; Meza, I. Challenges of language technologies for the indigenous languages of the Americas. arXiv 2018, arXiv:1806.04291. [Google Scholar]
Mager, M.; Oncevay, A.; Ebrahimi, A.; Ortega, J.; Gonzales, A.R.; Fan, A.; Gutierrez-Vasques, X.; Chiruzzo, L.; Giménez-Lugo, G.; Ramos, R.; et al. Findings of the AmericasNLP 2021 shared task on open machine translation for indigenous languages of the Americas. In Proceedings of the First Workshop on Natural Language Processing for Indigenous Languages of the Americas; Association for Computational Linguistics: Stroudsburg, PA, USA, 2021; pp. 202–217. [Google Scholar]
Agüero Torales, M.M. Machine Learning approaches for Topic and Sentiment Analysis in multilingual opinions and low-resource languages: From English to Guarani. Proces. Leng. Nat. 2023, 70, 235–238. [Google Scholar]
Gasser, M. Machine translation and the future of indigenous languages. In Proceedings of the I Congreso Internacional de Lenguas y Literaturas Indoamericanas, Temuco, Chile, 12–13 October 2006. [Google Scholar]
Jimerson, R.; Liu, Z.; Prud’hommeaux, E. An (unhelpful) guide to selecting the best ASR architecture for your under-resourced language. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), Toronto, ON, Canada, 9–14 July 2023; pp. 1008–1016. [Google Scholar]
Sasmal, S.; Saring, Y. Robust automatic continuous speech recognition for’Adi’, a zero-resource indigenous language of Arunachal Pradesh. Sādhanā 2022, 47, 271. [Google Scholar] [CrossRef]
Coto-Solano, R.; Nicholas, S.A.; Datta, S.; Quint, V.; Wills, P.; Powell, E.N.; Koka‘ua, L.; Tanveer, S.; Feldman, I. Development of automatic speech recognition for the documentation of Cook Islands Māori. Proc. Lang. Resour. Eval. Conf. 2022, 13, 3872–3882. [Google Scholar]
Chuctaya, H.F.C.; Mercado, R.N.M.; Gaona, J.J.G. Isolated automatic speech recognition of Quechua numbers using MFCC, DTW and KNN. Int. J. Adv. Comput. Sci. Appl. 2018, 9, 24–29. [Google Scholar] [CrossRef]
Adams, O.; Wiesner, M.; Watanabe, S.; Yarowsky, D. Massively multilingual adversarial speech recognition. arXiv 2019, arXiv:1904.02210. [Google Scholar]
Zevallos, R.; Cordova, J.; Camacho, L. Automatic speech recognition of quechua language using hmm toolkit. In Proceedings of the Annual International Symposium on Information Management and Big Data, Lima, Peru, 21–23 August 2019; pp. 61–68. [Google Scholar]
Zevallos, R.; Bel, N.; Cámbara, G.; Farrús, M.; Luque, J. Data Augmentation for Low-Resource Quechua ASR Improvement. arXiv 2022, arXiv:2207.06872. [Google Scholar]
Maldonado, D.M.; Villalba Barrientos, R.; Pinto-Roa, D.P. Eñe’ e: Sistema de reconocimiento automático del habla en Guaraní. In Proceedings of the Simposio Argentino de Inteligencia Artificial (ASAI 2016)-JAIIO 45 (Tres de Febrero, 2016), Buenos Aires, Argentina, 5–9 September 2016. [Google Scholar]
Peterson, K.; Tong, A.; Yu, Y. OpenASR20: An Open Challenge for Automatic Speech Recognition of Conversational Telephone Speech in Low-Resource Languages. In Proceedings of the Interspeech, Brno, Czech Republic, 30 August–3 September 2021; pp. 4324–4328. [Google Scholar]
Peterson, K.; Tong, A.; Yu, Y. OpenASR21: The Second Open Challenge for Automatic Speech Recognition of Low-Resource Languages. Proc. Interspeech 2022, 4895–4899. [Google Scholar]
Koumparoulis, A.; Potamianos, G.; Thomas, S.; da Silva Morais, E. Resource-efficient TDNN Architectures for Audio-visual Speech Recognition. In Proceedings of the 2021 29th European Signal Processing Conference (EUSIPCO), Dublin, Ireland, 23–27 August 2021; pp. 506–510. [Google Scholar] [CrossRef]
Zhao, J.; Zhang, W.Q. Improving automatic speech recognition performance for low-resource languages with self-supervised models. IEEE J. Sel. Top. Signal Process. 2022, 16, 1227–1241. [Google Scholar] [CrossRef]
Coto-Solano, R.; Sofía, F.S. Alineación forzada sin entrenamiento para la anotación automática de corpus orales de las lenguas indígenas de Costa Rica. Káñina 2017, 40, 175–199. [Google Scholar] [CrossRef]
Coto-Solano, R. Explicit Tone Transcription Improves ASR Performance in Extremely Low-Resource Languages: A Case Study in Bribri. In Proceedings of the First Workshop on Natural Language Processing for Indigenous Languages of the Americas; Association for Computational Linguistics: Stroudsburg, PA, USA, 2021; pp. 173–184. [Google Scholar] [CrossRef]
Chen, C.C.; Chen, W.; Zevallos, R.; Ortega, J. Evaluating Self-Supervised Speech Representations for Indigenous American Languages. arXiv 2023, arXiv:2310.03639. [Google Scholar]
Coto-Solano, R. Evaluating Word Embeddings in Extremely Under-Resourced Languages: A Case Study in Bribri. In Proceedings of the 29th International Conference on Computational Linguistics, Gyeongju, Republic of Korea, 12–17 October 2022; pp. 4455–4467. [Google Scholar]
Prud’hommeaux, E.; Jimerson, R.; Hatcher, R.; Michelson, K. Automatic speech recognition for supporting endangered language documentation. Lang. Doc. Conserv. 2021, 15, 491–513. [Google Scholar]
Krasnoukhova, O. Attributive modification in South American indigenous languages. Linguistics 2022, 60, 745–807. [Google Scholar] [CrossRef]
Conneau, A.; Khandelwal, K.; Goyal, N.; Chaudhary, V.; Wenzek, G.; Guzmán, F.; Grave, E.; Ott, M.; Zettlemoyer, L.; Stoyanov, V. Unsupervised Cross-lingual Representation Learning at Scale. arXiv 2019, arXiv:1911.02116. [Google Scholar]
Yi, C.; Wang, J.; Cheng, N.; Zhou, S.; Xu, B. Transfer Ability of MonolingualWav2vec2.0 for Low-resource Speech Recognition. In Proceedings of the 2021 International Joint Conference on Neural Networks (IJCNN), Virtual, 18–22 July 2021; pp. 1–6. [Google Scholar] [CrossRef]
N, K.D.; Wang, P.; Bozza, B. Using Large Self-Supervised Models for Low-Resource Speech Recognition. In Proceedings of the Proc. Interspeech 2021, Brno, Czech Republic, 30 August–3 September 2021; pp. 2436–2440. [Google Scholar] [CrossRef]
Torre, I.G.; Romero, M.; Álvarez, A. Improving aphasic speech recognition by using novel semi-supervised learning methods on aphasiabank for english and spanish. Appl. Sci. 2021, 11, 8872. [Google Scholar] [CrossRef]
Tang, J.; Chen, W.; Chang, X.; Watanabe, S.; MacWhinney, B. A New Benchmark of Aphasia Speech Recognition and Detection Based on E-Branchformer and Multi-task Learning. arXiv 2023, arXiv:2305.13331. [Google Scholar]
Radford, A.; Kim, J.W.; Xu, T.; Brockman, G.; McLeavey, C.; Sutskever, I. Robust Speech Recognition via Large-Scale Weak Supervision. arXiv 2022, arXiv:2212.04356. [Google Scholar]
Babu, A.; Wang, C.; Tjandra, A.; Lakhotia, K.; Xu, Q.; Goyal, N.; Singh, K.; von Platen, P.; Saraf, Y.; Pino, J.; et al. XLS-R: Self-supervised Cross-lingual Speech Representation Learning at Scale. arXiv 2021, arXiv:2111.09296. [Google Scholar]
Ardila, R.; Branson, M.; Davis, K.; Henretty, M.; Kohler, M.; Meyer, J.; Morais, R.; Saunders, L.; Tyers, F.M.; Weber, G. Common voice: A massively-multilingual speech corpus. arXiv 2019, arXiv:1912.06670. [Google Scholar]
Gales, M.J.; Knill, K.M.; Ragni, A.; Rath, S.P. Speech recognition and keyword spotting for low-resource languages: BABEL project research at CUED. In Proceedings of the Fourth International Workshop on Spoken Language Technologies for Under-Resourced Languages (SLTU-2014), St. Petersburg, Russia, 14–16 May 2014; pp. 16–23. [Google Scholar]
Pratap, V.; Xu, Q.; Sriram, A.; Synnaeve, G.; Collobert, R. MLS: A Large-Scale Multilingual Dataset for Speech Research. arXiv 2020, arXiv:2012.03411. [Google Scholar]
Wang, C.; Riviere, M.; Lee, A.; Wu, A.; Talnikar, C.; Haziza, D.; Williamson, M.; Pino, J.; Dupoux, E. VoxPopuli: A Large-Scale Multilingual Speech Corpus for Representation Learning, Semi-Supervised Learning and Interpretation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Online, 1–6 August 2021; pp. 993–1003. [Google Scholar] [CrossRef]
Valk, J.; Alumäe, T. VoxLingua107: A Dataset for Spoken Language Recognition. arXiv 2020, arXiv:2011.12998. [Google Scholar]
Corpus Oral Pandialectal de la Lengua Bribri. 2017. Available online: http://bribri.net (accessed on 12 September 2023).
Grammar and Multilingual Practices through the Lens of Everyday Interaction in Two Endangered Languages in the East Tukano Family. 2017. Available online: http://hdl.handle.net/2196/00-0000-0000-0010-7D1A-A (accessed on 12 September 2023).
Kotiria Linguistic and Cultural Archive. Endangered Languages Archive. 2017. Available online: http://hdl.handle.net/2196/00-0000-0000-0002-05B0-5 (accessed on 12 September 2023).
Siminchikkunarayku. Available online: https://www.siminchikkunarayku.pe/ (accessed on 12 September 2023).
Universidad de Costa Rica. Portal de la Lengua Bribri SE’IE. 2021. Available online: https://vinv.ucr.ac.cr/es/tags/lengua-bribri (accessed on 12 September 2023).
live.bible.is. Available online: https://live.bible.is (accessed on 12 September 2023).
Brown, M.; Tucker, K. Data from Quipu Project (12-2018). 2020. Available online: https://research-information.bris.ac.uk/en/datasets/data-from-quipu-project-12-2018 (accessed on 12 September 2023).
Heafield, K. KenLM: Faster and smaller language model queries. In Proceedings of the Sixth Workshop on Statistical Machine Translation, Edinburg, UK, 30–31 July 2011; pp. 187–197. [Google Scholar]
Shahriari, B.; Swersky, K.; Wang, Z.; Adams, R.P.; De Freitas, N. Taking the human out of the loop: A review of Bayesian optimization. Proc. IEEE 2015, 104, 148–175. [Google Scholar] [CrossRef]
Akiba, T.; Sano, S.; Yanase, T.; Ohta, T.; Koyama, M. Optuna: A Next-generation Hyperparameter Optimization Framework. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Anchorage, AK, USA, 4–8 August 2019. [Google Scholar]
Sobol, I.M. Global sensitivity indices for nonlinear mathematical models and their Monte Carlo estimates. Math. Comput. Simul. 2001, 55, 271–280. [Google Scholar] [CrossRef]
Langie, K.M.G.; Tak, K.; Kim, C.; Lee, H.W.; Park, K.; Kim, D.; Jung, W.; Lee, C.W.; Oh, H.S.; Lee, D.K.; et al. Toward economical application of carbon capture and utilization technology with near-zero carbon emission. Nat. Commun. 2022, 13, 7482. [Google Scholar] [CrossRef]
Schneider, K.; Van der Werf, W.; Cendoya, M.; Mourits, M.; Navas-Cortés, J.A.; Vicent, A.; Oude Lansink, A. Impact of Xylella fastidiosa subspecies pauca in European olives. Proc. Natl. Acad. Sci. USA 2020, 117, 9250–9259. [Google Scholar] [CrossRef] [PubMed]
Linardatos, P.; Papastefanopoulos, V.; Kotsiantis, S. Explainable AI: A review of machine learning interpretability methods. Entropy 2020, 23, 18. [Google Scholar] [CrossRef] [PubMed]
Antoniadis, A.; Lambert-Lacroix, S.; Poggi, J.M. Random forests for global sensitivity analysis: A selective review. Reliab. Eng. Syst. Saf. 2021, 206, 107312. [Google Scholar] [CrossRef]
Wang, Z.; Li, M.; Ren, F.; Ma, B.; Yang, H.; Zhu, Y. Sobol sensitivity analysis and multi-objective optimization of manifold microchannel heat sink considering entropy generation minimization. Int. J. Heat Mass Transf. 2023, 208, 124046. [Google Scholar] [CrossRef]
Cai, D.; Li, M. Leveraging ASR Pretrained Conformers for Speaker Verification Through Transfer Learning and Knowledge Distillation. IEEE/ACM Trans. Audio Speech Lang. Process. 2024, 1–14. [Google Scholar] [CrossRef]
Kakuba, S.; Poulose, A.; Han, D.S. Deep Learning Approaches for Bimodal Speech Emotion Recognition: Advancements, Challenges, and a Multi-Learning Model. IEEE Access 2023, 11, 113769–113789. [Google Scholar] [CrossRef]
Shahamiri, S.R. Speech Vision: An End-to-End Deep Learning-Based Dysarthric Automatic Speech Recognition System. IEEE Trans. Neural Syst. Rehabil. Eng. 2021, 29, 852–861. [Google Scholar] [CrossRef] [PubMed]
Romero, M.; Gomez, S.; Torre, I.G. ASR advancements for indigenous languages: Quechua, Guarani, Bribri, Kotiria, and Wa’ikhana. arXiv 2024, arXiv:2404.08368. [Google Scholar]

Figure 1. Sketch of the dataset used for fine-tuning the ASR system, the CNN and transformer-based architecture wav2vec2.0, the fine-tuning process, the Bayesian hyperparameter search and the Sobol sensitivity analysis.

Figure 2. The outer bar chart panel displays the character error rates (CERs) for five Indigenous language models: Kotiria, Wa’ikhana, Bribri, Guarani, and Quechua. Lower bars indicate better-quality performance of the model. The inner panel provides a Sobol’ sensitivity analysis of the various hyperparameters tuned during model training, assessing their impact on model performance variability. The orange bars represent the total sensitivity (ST) index, while the green bars indicate the first-order sensitivity (S1) index. A higher bar indicates the more importance of that hyperparameter when correctly choosing it during the fine-tuning phase.

Table 1. Detailed description of train-dev-test splits for each language and data source. Primary refers to the corpus provided by the AmericasNLP organization. Speed augmentation is performed only on the primary database and external refers to corpus collected from other sources.

	Train				Dev	Test
	Primary	Speed Augm.	External	Total	Primary	Primary
Bribri	$0.49$ h	$0.98$ h	$1.14$ h	$2.61$ h	$0.04$ h	$0.19$ h
Guarani	$0.32$ h	$0.64$ h	-	$0.97$ h	$0.02$ h	$0.12$ h
Kotiria	$2.69$ h	$5.43$ h	$21.8$ h	$29.92$ h	$0.5$ h	$0.3$ h
Wai’khana	$1.49$ h	$2.98$ h	-	$4.39$ h	$0.1$ h	$0.21$ h
Quechua	$1.67$ h	$3.34$ h	$7.04$ h	$12.09$ h	$0.2$ h	$2.08$ h

Table 2. Hyperparameter search range during the fine-tuning training phase.

	Minimum Value	Maximum Value
Learning rate	$10^{- 6}$	$10^{- 3}$
Max number of updates	10k	100k
Freeze fine-tune updates	0	50k
Activation dropout	$0.01$	$0.2$
Mask probability	$0.2$	$0.7$
Mask channel probability	$0.1$	$0.7$

Table 3. WERs and CERs for the best fine-tuning hyperparameter configurations for Quechua, Kotiria, Bribri, Guarani, and Wa’ikhana. Additional configurations and results can be found in the Supplementary Information.

Language	Learning Rate	Max Updates	Freeze Fine-Tune	Mask Channel	Activation Dropout	WER	CER
Quechua	$1 \times 10^{- 5}$	90k	5k	$0.5$	$0.1$	$48.98$	$12.14$
Kotiria	$1 \times 10^{- 5}$	40k	5k	$0.5$	$0.1$	$79.69$	$36.59$
Bribri	$1 \times 10^{- 4}$	90k	8k	$0.3$	$0.2$	$69.03$	$34.70$
Guarani	$1 \times 10^{- 5}$	90k	9k	$0.3$	$0.1$	$62.91$	$15.59$
Wa’ikhana	$1 \times 10^{- 5}$	130k	1k	$0.25$	$0.1$	$68.42$	$35.23$

Table 4. Sensitivity analysis results using first-order (S1) Sobol’ and global (ST) indices for the different fine-tuning hyperparameters. Parameters have been ordered by their ST importance.

Parameter	S1	ST
Freeze fine-tune updates	$0.32$	$0.41$
Activation dropout	$0.35$	$0.37$
Mask prob	$0.54$	$0.29$
Mask channel prob	$0.54$	$0.28$
Learning rate (lr)	$0.57$	$0.26$
Max update	$\sim 0$	$0.13$

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Romero, M.; Gómez-Canaval, S.; Torre, I.G. Automatic Speech Recognition Advancements for Indigenous Languages of the Americas. Appl. Sci. 2024, 14, 6497. https://doi.org/10.3390/app14156497

AMA Style

Romero M, Gómez-Canaval S, Torre IG. Automatic Speech Recognition Advancements for Indigenous Languages of the Americas. Applied Sciences. 2024; 14(15):6497. https://doi.org/10.3390/app14156497

Chicago/Turabian Style

Romero, Monica, Sandra Gómez-Canaval, and Ivan G. Torre. 2024. "Automatic Speech Recognition Advancements for Indigenous Languages of the Americas" Applied Sciences 14, no. 15: 6497. https://doi.org/10.3390/app14156497

APA Style

Romero, M., Gómez-Canaval, S., & Torre, I. G. (2024). Automatic Speech Recognition Advancements for Indigenous Languages of the Americas. Applied Sciences, 14(15), 6497. https://doi.org/10.3390/app14156497

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Automatic Speech Recognition Advancements for Indigenous Languages of the Americas

Abstract

1. Introduction

2. Related Work

3. Experimental Setup and Dataset Description

3.1. Main Architecture and Pre-Trained Models

3.2. Data

3.3. Decoding Strategy and Language Models

3.4. Hyperparameter Fine-Tuning

3.5. Sobol’ Sensitivity Analysis for Hyperparameter Explanation

4. Results

5. Conclusions and Future Work

6. Data and Models Accessibility

Supplementary Materials

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI