Next Article in Journal
Computer Vision-Based Unobtrusive Physical Activity Monitoring in School by Room-Level Physical Activity Estimation: A Method Proposition
Next Article in Special Issue
Terminology Translation in Low-Resource Scenarios
Previous Article in Journal
Study on Unknown Term Translation Mining from Google Snippets
Previous Article in Special Issue
Crowdsourcing the Paldaruo Speech Corpus of Welsh for Speech Technology

Information 2019, 10(9), 268;

The Usefulness of Imperfect Speech Data for ASR Development in Low-Resource Languages
by 1,*,† and 1,2,†
Human Technologies Research Group, CSIR Next Generation Enterprises and Institutions Cluster, P.O. Box 395, Pretoria 0001, South Africa
Department of Electrical & Electronic Engineering, Stellenbosch University, Private Bag X1, Stellenbosch 7602, South Africa
Correspondence: [email protected]
These authors contributed equally to this work.
Received: 29 June 2019 / Accepted: 8 August 2019 / Published: 28 August 2019


When the National Centre for Human Language Technology (NCHLT) Speech corpus was released, it created various opportunities for speech technology development in the 11 official, but critically under-resourced, languages of South Africa. Since then, the substantial improvements in acoustic modeling that deep architectures achieved for well-resourced languages ushered in a new data requirement: their development requires hundreds of hours of speech. A suitable strategy for the enlargement of speech resources for the South African languages is therefore required. The first possibility was to look for data that has already been collected but has not been included in an existing corpus. Additional data was collected during the NCHLT project that was not included in the official corpus: it only contains a curated, but limited subset of the data. In this paper, we first analyze the additional resources that could be harvested from the auxiliary NCHLT data. We also measure the effect of this data on acoustic modeling. The analysis incorporates recent factorized time-delay neural networks (TDNN-F). These models significantly reduce phone error rates for all languages. In addition, data augmentation and cross-corpus validation experiments for a number of the datasets illustrate the utility of the auxiliary NCHLT data.
automatic speech recognition; low-resource languages; speech data; speech technology; Kaldi; time-delay neural networks

1. Introduction

The development of language and speech technology requires substantial amounts of appropriate data. While huge volumes of text and speech data are available in some languages, others have very little with which to work. Languages in the first category are commonly referred to as “highly resourced”, while those in the second category are known as “under-resourced” (low-resourced languages do have sufficient data for initial model development). The work we report on in this paper is part of an ongoing effort to enlarge the resources that are available for technology development in South Africa’s 11 official languages (three letter ISO codes in brackets): Afrikaans (Afr), South African English (Eng), isiNdebele (Nbl), isiXhosa (Xho), isiZulu (Zul), Sepedi (Nso), Sesotho (Sot), Setswana (Tsn), Siswati (Ssw), Tshivenda (Ven), and Xitsonga (Tso).
Work in this area has been supported by the South African government for a number of years. Initial projects were funded by the Department of Arts, Culture, Science and Technology (DACST) and subsequently by the Departments of Arts and Culture (DAC) and Science and Technology (DST), respectively, after the two departments became separate entities. For instance, the African Speech Technology (AST) project [1] was supported by DACST, while DAC funded projects like Lwazi [2,3] and the National Centre for Human Language Technology (NCHLT) speech [4,5] and text [6] projects. The recently-established South African Centre for Digital Language Resources ( (SADiLaR) is funded by DST.
Various strategies have been proposed to collect speech and text resources for technology development, for example harvesting existing data like broadcast news and online publications, crowd-sourcing, web crawling, dedicated data collection campaigns, etcetera [7,8,9,10,11,12,13]. Both data types are required for language and speech technology development, and constructing comprehensive text corpora is just as important as creating speech resources. However, the work we report on here mainly concerns speech data.
One of the most efficient ways to collect vast volumes of speech data is by means of speech applications like voice search, where input speech is captured and used to improve system performance [14]. Other strategies that have been proven to be successful include crowd sourcing and transcribing or translating existing resources.
In the absence of these possibilities, dedicated data collection campaigns can be used to collect representative samples of languages in their spoken form. In South Africa, the AST, Lwazi, and the first NCHLT project relied on data collection to create speech resources for the indigenous languages. During the Lwazi project, telephone speech was collected (between four and ten hours per language [2]), while the aim of the first NCHLT project was to collect 50–60 h of orthographically-transcribed, broadband speech in each of the country’s 11 official languages [4].

2. Background

Honest researchers and field workers can affirm that, despite careful design, meticulous planning, and continuous monitoring of execution, data collection does not always happen the way it should. No matter how carefully one goes about it, there always seems to be errors of one kind or another in the collected data [15,16,17]. Unforeseen challenges or delays in data collection could be due to issues related to the means of collection (e.g., telephone lines are out of order on the day that collection was planned to start), logistics (e.g., the bus that was supposed to bring volunteers to a suitable location broke down on the way), the attitude or literacy levels of potential participants, and so forth. The NCHLT speech project was no exception in this regard, and despite the fact that the project was successfully executed, not everything went exactly as planned.
During the project speech, data was collected using a smartphone application [11]. The initial version of the app used a prompt counter to select a unique subset of prompts for each recording session. However, this value was stored in memory and was sometimes accidentally reset as fieldworkers cleared recording devices. This resulted in some subsets of the data being recorded multiple times while other subsets were never selected. The app was subsequently updated to support random selection of prompts from the larger vocabulary, and additional, more diverse data was collected in some languages. To meet the project specifications, the majority of the repeated prompts were excluded from the subset of the data that was released as the NCHLT Speech corpus.
It is often said that “there is no data like more data”, and given the modeling capabilities of some recent acoustic modeling techniques, the question arose whether the data that was excluded from the official NCHLT corpus could be used to improve modeling accuracy. In this paper, we therefore investigate the potential of the additional or auxiliary data to improve acoustic models of the languages involved, given current best practices.
While the results of many studies seem to confirm that “there really is no data like more data”, the “garbage in, garbage out” principle also holds: using poor quality data will result in poor models, no matter how much of it is available. Poor models will ultimately yield poor results. One of the aims of our investigation was thus to quantify, to some extent, the quality of the utterances in the auxiliary datasets and to exclude potential “garbage” from the pool of additional data.
Basic verification steps were included in the NCHLT data collection protocol to identify corrupt and/or empty files. In the current study, we also used forced alignment to identify recordings that did not match their prompts. A phone string corresponding to the expected pronunciation of each prompt was generated, and if a forced alignment between the phone string and the actual acoustics failed, the utterance was not included in the auxiliary data. For the remaining prompts, we used a phone-based dynamic programming (PDP) scoring technique [18,19] to quantify the degree of acoustic match between the expected and produced pronunciations of each prompt and to rank them accordingly. Consequently, transcription errors or bad acoustic recording conditions could be filtered out based on an utterance level measure.
Baseline automatic speech recognition (ASR) results for both the Hidden Markov Model Toolkit (HTK) [20] and Kaldi [21] toolkits were published when the NCHLT Speech corpus was released. The Kaldi implementation of Subspace Gaussian Mixture Models (SGMMs) yielded the best results [4]. Subsequent experiments using one of the languages (Xho) showed that substantial gains can be achieved over the initial baseline if the acoustic models are implemented using deep neural networks (DNNs) [22]. Similar observations were made for the Lwazi telephone corpus [23] and DNNs optimized using sequence-discriminative training within a state-level minimum Bayes risk criterion. However, according to recent studies, time delay neural networks (TDNN) [24,25] and long short-term memory (LSTM) acoustic models outperform DNN-based models [26].
A model architecture that combines TDNNs and bi-directional LSTMs (BLSTMs) yielded the best results in a preliminary study on the auxiliary NCHLT data [19]. BLSTMs process input data in both time directions using two separate hidden layers. In this manner, they preserve both past and future context information [27]. The interleaving of temporal convolution and BLSTM layers has been shown to model future temporal context effectively [28]. When BLSTMs are trained on limited datasets, configurations with more layers (as many as five) outperform similar systems with fewer layers (three or less). Larger training sets (approaching 100 h of data) obtain even better performance using six layers [29].
Ongoing research aims to incorporate deeper TDNNs since it is known that more layers have significantly improved the performance of image recognition tasks [30]. However, the gate mechanism in LSTMs still seems to have utility to selectively train TDNNs by emphasizing the more important input dimensions for a particular piece of audio [31]. In this paper, we report results obtained using TDNN-F acoustic models, which have recently been demonstrated to be effective in resource-constrained scenarios [32]. Apart from reducing the number of parameters (and connections) of a single layer, the singular-value decomposition operation also proves effective with deeper network architectures. In particular, it has been found that tuning the TDNN-F networks resulted in networks with as many as 11 layers [32]. The best Kaldi Librispeech chain model example recipe used in this study contained as many as 17 layers (Section 4.1).
The next section of the paper describes the NCHLT data, as well as the extent of repetition in the auxiliary datasets. Subsequent sections introduce the techniques that were used to quantify the quality of the auxiliary recordings and present TDNN-F results for all 11 languages. The paper also includes experiments that were conducted to determine whether the acoustic models benefited from the inclusion of the auxiliary data in the training set. The recognition performance of models trained on different training sets was measured on out-of-domain datasets.

3. Data

As was pointed out in Section 1, the recordings that were made during the initial phase of the NCHLT Speech project contained many repetitions of some prompts. Additional data was therefore collected to ensure that the corpus met the acoustic diversity stipulated in the project requirements. For a number of languages, this sequence of events resulted in two datasets being collected: one set with many examples of the same prompts and one set with fewer examples of many different prompts.
Participants in the NCHLT data collection campaign were asked to read text prompts displayed on a smartphone screen. The prompts were compiled using a text selection algorithm that determined the most frequently-observed n-grams for each language. The algorithm was used to derive prompts from the biggest text corpus that was available for each language (at the time) [6]. A mobile data collection tool was subsequently used to record the prompts while they were read out by participants [11].
Given that participants were asked to read text displayed on a mobile device, a reasonable match between the audio and text data can be expected. The recorded speech was therefore not transcribed manually. However, poor matches between prompts and their recordings did occur, usually as a result of reading errors, high levels of background noise, hesitations, etcetera. A confidence scoring technique was used to identify recordings that did not match their associated transcriptions. Recordings that had a high confidence score (well-matched with their associated transcriptions) and that contributed most to lexical diversity were selected to be included in the final version of the corpus. An additional specification stipulated that the corpus should contain an equal amount of data ( ± 56 h of speech) for all 11 languages. Due to this restriction, data that could be of a sufficiently good acoustic quality was not included in the final corpus. To clarify exactly which part of the recorded data we refer to, we adhere to the dataset definitions that were published with the first version of the corpus:
  • NCHLT-raw
    The total set of usable data collected after all empty and otherwise unusable recordings were discarded. This includes multiple sessions of some speakers and multiple examples of some prompts.
  • NCHLT-baseline
    A subset of NCHLT-raw representing approximately 200 unique speakers per language and more than 200 utterances per speaker. Recordings from the more diverse second batch of data were given preference in cases where speakers participated in both data collection campaigns.
  • NCHLT-clean
    A subset of NCHLT-baseline constituting the final deliverable of ± 56 h of speech data for all 11 official languages. For ASR evaluation purposes, this dataset was partitioned into a training and test set. The test partitions consisted of eight speakers (equal numbers of male and female speakers) that were manually selected. The development data was taken from the training sets defined in [4] and was selected to contain another eight speakers each (The composition of the test set is included in the official corpus. We used the development set defined for the experiments in the 2014 corpus paper. The file lists can be downloaded here:
The Aux1 dataset is comprised of the data in NCHLT-baseline that was not included in NCHLT-clean (the same speakers therefore occur in Aux1 and the NCHLT-clean dataset). Aux2 refers to all the NCHLT-raw utterances that are not in NCHLT-baseline. Table 1 presents the initial number of recordings (init) in the Aux1 and Aux2 datasets for each language.
The values in the failed column correspond to the number of utterances in each dataset for which the alignment procedure described in Section 4.4 failed. The percentage values in the last row of Table 1 indicate that more than 90% of both the datasets could be aligned and could therefore be considered for harvesting. This corresponds to 780.57 and 640.70 h of audio in the Aux1 and Aux2 sets, respectively.

3.1. Unique and Repeated Prompts

Shortly after the release of the NCHLT Speech corpus, an overview of the unique and repeated prompts was reported in [33]. Table 2 and Table 3 provide type and token counts for the prompts in the NCHLT-clean, Aux1, and Aux2 datasets.
The values in the NCHLT_TRN Type column correspond to the number of unique prompts in the NCHLT training set. The counts for prompt types that occur in the test set, but not in the training set are listed in the NCHLT_TST Type column. NCHLT_TRN_TST types correspond to unique prompts that occur in both the training and the test sets (Type and token counts for the NCHLT_DEV set are not included in the table. On average, the development sets contain around 3000 prompt tokens.). The Aux1 and Aux2 columns indicate how many of these types also occur in the auxiliary data. The type and token counts for the unique prompts that occur only in the auxiliary data are provided in the last four columns of Table 3. These values indicate that the auxiliary data mostly contains repetitions of prompts that are already in the NCHLT-clean corpus. Table 4 and Table 5 contain word level type and token counts for all the datasets.

3.2. Speaker Mapping

During the second phase of data collection, some speakers that were already in the initial dataset, participated in the data collection again. Apart from ensuring vocabulary diversity and well-matched transcriptions during the NCHLT-baseline subset selection, duplicate speaker sessions were avoided. A rather conservative approach was followed to identify possible duplicate speakers from available metadata. In particular, three data fields in the metadata were used to identify overlapping speaker sessions: names, national identity and telephone numbers. Speaker duplication was flagged if any of the fields were identical or differed by only one digit. The data corresponding to duplicate speakers was subsequently clustered into a single set with a unique speaker identity.
As was mentioned in Section 3, the Aux1 data has exactly the same speaker numbers as NCHLT-clean. However, speaker overlap with the NCHLT-baseline speakers can be expected for the Aux2 recordings (The test speaker overlap for Afr is an exception. According to the metadata, it seems that two test speakers occur in the NCHLT-clean training data as well.). To quantify the extent of the overlap, Table 6 shows the number of speaker clusters that were identified per language following a similar metadata-based detection process.
In the table, a match is reported if any of the metadata fields were identical, while a difference of one digit constituted close matches. The number of speakers in the Aux2 corpora is much higher than the detected overlapping speakers (all). Therefore, Aux2 also contains data from additional speakers who are not represented in the NCHLT-clean corpora. The table also indicates that, for six languages, speakers whose data is included in the predefined NCHLT development (dev) and test (tst) sets may have contributed to the Aux2 set as well.

3.3. Phone Representations

Our data analysis required phone level transcriptions of all the auxiliary data. Text pre-processing was required to prepare the transcriptions for pronunciation extraction. All text was converted to lowercase, and unwanted symbols (not within the list of graphemes for a particular language) were removed. Since numerous additional words (see Table 4) occurred in the auxiliary data, the existing NCHLT pronunciation dictionaries had to be extended before the data could be processed.
During the NCHLT project, a set of grapheme-to-phoneme (G2P) rules was derived from the so-called NCHLT-inlang dictionaries [4]. These rules were used to predict pronunciations for the new words. No explicit procedure was followed to identify out-of-language words, but for some languages, the in-language G2P rules did not contain rules for particular graphemes or the punctuation mark used to indicate an apostrophe in English (Eng). For these words, the Eng G2P rules were used to generate pronunciations, and the phones were mapped (The mappings were derived manually, employing the closest Speech Assessment Methods Phonetic Alphabet (SAMPA) phone label from the same phone category. During the NCHLT project, the SAMPA computer-readable phonetic script was used to represent the phones of all 11 languages: to similar sounds in the in-language phone set.
Eng was the only language for which a different procedure was followed. The G2P rules that were used for Eng were derived from a version of the Oxford Advanced Learner’s dictionary, adapted to South African Eng using manually-developed phoneme-to-phoneme rules [34].

4. Experiments

This section presents ASR results obtained using the NCHLT-clean training data, as well as extended training sets that include auxiliary data. The development and test sets described in Section 3 were used throughout. Experiments were also conducted using cross-corpus validation data so that more general conclusions could be drawn from the results. The validation data was created during the Resources for Closely Related Languages (RCRL) project [35] and comprises 330 Afr news bulletins that were broadcast between 2001 and 2004 on the local Radio Sonder Grense (RSG) radio station. The bulletins were purchased from the South African Broadcasting Corporation (SABC) and transcribed to create a corpus of around 27 h of speech data. For the experiments in this study, we used a previously-selected 7.9 h evaluation set containing 28 speakers (To obtain the phone sequences from the RSG orthography, we implemented the same procedure as for the NCHLT Afr system. After text pre-processing, G2P rules were applied to generate pronunciations for new words.).
Two acoustic modeling recipes were followed to build all acoustic models. Section 4.1 describes the experimental setup. Since the focus of the current work was primarily on acoustic modeling, recognition performance was quantified in terms of phone recognition results (Section 4.2) in all experiments. In principle, improved phone recognition should translate to better word recognition results for a well-defined transcription task. Word recognition experiments were not included, because of the very limited amount of text corpora available for most of the NCHLT languages. After establishing a new baseline (Section 4.3), further data augmentation work using both Aux1 and Aux2 data was carried out. The selection criteria for auxiliary datasets (Section 4.4 and Section 4.5) allowed us to test the utility of the additional data with current acoustic modeling techniques. This section ends with cross-corpus validation experiments for a specific set of models Section 4.6.

4.1. Acoustic Modeling

The development of TDNN-BLSTM baseline acoustic models for all 11 languages was described in [19]. In this paper, the aim was to improve on the baseline by using TDNN-F acoustic models (Section 2). To create the new models, the same standard triphone recognition systems that were used in previous studies [19] were required to extract phone alignments for the training data.
A standard MFCC front-end with a 25-ms Hamming window and a 10-ms shift between frames (16-kHz sampling frequency) was employed to train all models for the triphone recognition systems. Mean and variance normalization operations, applied on a per speaker basis, followed the extraction of 13 cepstra, which included C0. Delta and double delta coefficients were added. These features were used to estimate three-state left-to-right HMM triphone models, incorporating linear discriminant analysis (LDA), maximum likelihood linear transform (MLLT) training, and speaker adaptive training (SAT).
Similar to the previous TDNN-BLSTM models, the TDNN-F recipes also require i-vectors [36] and 40-dimensional high-resolution MFCC features for training. I-vector extractors were trained based on the training parameters provided in the Kaldi Wall Street Journal (WSJ) example recipe without adjustment. The high-resolution MFCCs were derived from speed (using factors of 0.9 , 1.0 , and 1.1 [37]) and volume (choosing a random factor between 0.125 and 2) perturbed data. Speed perturbing was applied first, adding two speed-perturbed versions of the audio data used for training, after which volume perturbation was applied to the complete set (including the speed-perturbed versions).
We generated two different TDNN-F networks with the nnet3 Kaldi setup and refer to these TDNN-F recipes as the 1c [38] and 1d [39] recipes, respectively. Both recipes were taken from Kaldi Librispeech chain model examples. The nnet3 component graph of the TDNN-F 1c recipe contained 11 TDNN-F layers. For all layers, the cell-dimension was kept at 1280 and the bottleneck-dimension at 256, respectively. In contrast, the TDNN-F 1d recipe’s component graph implements 17 layers of a larger cell-dimension (1536) and smaller bottleneck-dimension (160) each. It also implements a dropout schedule of “0,[email protected],[email protected],0” defining a piecewise linear function f(x) that is linearly interpolated between the points f(0) = 0, f(0.2) = 0, f(0.5) = 0.5, and f(1) = 0. Dropout schedules of this form were recommended in [40] to guard against overfitting.

4.2. Phone Recognition Measurement

A position independent phone configuration was used to convert the training transcriptions to a phone-level representation. During system evaluation, this arrangement seamlessly converts the standard Kaldi word error rate (WER) measurement to a phone error rate (PER). PERs were calculated using only speech phone labels. Silence labels were not taken into consideration. Recognition employed a flat ARPA language model consisting of equiprobable one-grams.
The best ratio between acoustic and language model contributions was determined by varying the language-scale parameter (integer values in the range of 1–20) during scoring. The acoustic-scale parameter was set to the default value of 0.1 , and the best language-scale parameter was chosen using the NCHLT-clean development datasets. The selected language-scale parameters were subsequently used during data harvesting to gauge recognition performance.

4.3. Baseline Systems

Table 7 compares the development (dev) and test (tst) set PER results of the TDNN-BLSTM baseline [19] with the new TDNN-F acoustic models. Both the TDNN-F 1c and TDNN-F 1d recipes (see Section 4.1) were evaluated for all 11 languages. The number of phone labels (#Phns) provides an indication of the label complexity.
The results in Table 7 show that, except for Nbl, the PERs of all the languages improved substantially compared to the TDNN-BLSTM baseline. Furthermore, in all cases, the TDNN-F 1d recipe yielded better results than the TDNN-F 1c recipe.

4.4. Acoustic Ranking

Not all the Aux1 and Aux2 data could be used as training or test data. The auxiliary data was screened to detect acoustically-compromised recordings using the TDNN-BLSTM acoustic models [19] (the data harvesting procedure was not repeated with TDNN-F models). The screening procedure required each utterance to be decoded twice.
First, standard free phone decoding implementing an ergodic phone loop generated a sequence of phone labels, purely based on the acoustics. Next, Kaldi’s functionality to compute training alignments from lattices for nnet3 models was used. This algorithm generates a decoding graph for a single fixed sequence of phone labels, which directly corresponds to the reference transcription. In the event that the acoustics are not a good match for the forced sequence of phone labels, this constraint can result in the decode operation exiting without producing any output. Such unsuccessful decodes served as a first selection criterion to filter out large transcription errors. The number of utterances that were discarded for the Aux1 and Aux2 datasets is shown in the failed columns in Table 1.
As was explained in Section 2, PDP scoring matched the free phone decode and forced phone label sequences. It is possible to adjust the PDP algorithm using a cost matrix so that string edit operations (substitution, deletion, and insertion) contribute differently for the various phone labels [41]. A flat phone matrix was chosen where the contributions of the edit operations are the same for all phone labels. Insertions and deletions contributed half as much to the score as substitutions and correctly-recognised labels.

4.5. Data Selection

The first data augmentation experiment was conducted with only the Aux1 data added to the NCHLT training sets. This meant that the speaker labels agreed with those of NCHLT-clean and secondly that the vocabulary of the augmentation data would be similarly diverse. To select suitable subsets of additional training data, we estimated local PERs for 400 utterances at a time.
Figure 1 depicts graphs of the local PERs. These values were computed for non-overlapping subsets of utterances, ordered according to PDP scores. Figure 1 reveals a large range of PER scores for different subsets of utterances. In a few estimations, PERs of higher than 100% occur, which can be explained in terms of the PER estimation formula. PERs of higher than 100% can occur due to, for example, runaway insertions during free phone recognition. At an operating point of 50% PER, more than 20 h and for some languages even more than 60 h of additional data can be selected. In [19], it was decided to use a conservative estimate of 30% PER. This selection strategy resulted in some improvement given the TDNN-BLSTM baseline.
Repeating the experiment for more languages, using the new TDNN-F models, generated the set of results given in Table 8. For each language, the amount of augmentation data (in hours) for the 30% selection criterion is displayed. With the exception of English, results for seven other languages with more than 20 h of acceptable Aux1 data were obtained. Comparing the TDNN-F baseline (base) results with those obtained for the systems trained on the augmented data (30%) showed that using the additional data did not result in improved system performance. The within-corpus modeling capability of the TDNN-F models remained similar.

4.6. Cross-Corpus Validation

Measuring recognition performance on a well-matched test set provides an indication of modeling efficiency, but in practice, ASR systems have much more utility if generalization to speech databases from other domains can be achieved. To simulate this scenario for acoustic modeling based on NCHLT data, the performance of the Afr models was evaluated on a different test set, the radio news data introduced at the beginning of the section.
Table 9 provides an overview of different acoustic models created by augmenting the NCHLT Afr training data with various selections of Aux1 and Aux2 data. The first entry (NCHLT-clean) corresponds to the TDNN-F baseline results for both recipes (see Table 7) and adds the new PERs that were obtained when validating on broadcast data (Radio). Here, the baseline result for the TDNN-F 1c recipe showed approximate agreement with the earlier findings in [19]. However, cross-corpus results for the TDNN-F 1d recipe improved even further.
To obtain the next three results, all auxiliary data passing the original alignment (cf. Table 1) was simply added to the NCHLT training data. Adding another 10 h of data increased the total amount of Aux1 augmentation data to 39.90 h, but did not further improve PERs. The same holds true when validating these models with RSG data (Aux1). Similarly, an attempt to augment training data with the entire set of Aux2 data did not improve recognition performance. Finally, more than doubling the training data by adding all 79.04 h of auxiliary data (Aux1 + Aux2) reduced the PER on cross-corpus data.
The bottom part of Table 9 shows the results of four experiments based on more refined selection efforts. Firstly, we added the radio data validation results for the Aux1 (30%) experiment. Interestingly, augmenting with this selection of higher quality Aux1 data did improve PER for the TDNN-F 1c recipe. The 30% selection criterion for the Aux1 Afr data approximately corresponded to selecting all harvested utterances with a PDP score higher than 0.53 . With our setup, the PDP scoring resulted in a value of 1.00 for matching transcriptions. Applying an even more strict threshold (a PDP score of 0.85), selected only 8.17 h of Aux1 data, but provided another indication of improved PER for the TDNN-F 1d model. In fact, this effect holds when augmenting with 19.74 h of Aux2 data and the 0.85 PDP score threshold. For this configuration, both TDNN-F 1c and TDNN-F 1d model validations achieved lower PERs. The trend continued when these high confidence-based selections of auxiliary data were combined in experiment Aux1 + Aux2 (0.85 PDP).

4.7. Discussion

As mentioned in Section 2, BLSTM training with six or more layers requires at least 100 h of speech data. However, the separate language components of the NCHLT corpus of the South African languages consist of about half (56 h) this amount of data. A pure BLSTM model was not experimented with since improved TDNN-BLSTM networks were available. Previously, these TDNN-BLSTM networks were successfully applied to all language components, resulting in significant improvement, even with the limited data [19].
With standard parameters, the more recent TDNN-F acoustic model recipes produced models capable of modeling NCHLT speech data even better than TDNN-BLSTMs. It was verified that the latest Kaldi example TDNN-F recipes, employing deeper networks, a smaller bottleneck-dimension, and higher cell-dimensions, outperformed previous baselines. Overall, the TDNN-F 1d recipe seemed to produce more consistent results with improved generalization to different datasets. This might not only be because of the deeper network and parameter settings, but also points out the importance of drop-out during training. Furthermore, drop-out combined with the deeper architecture of the TDNN-F 1d recipe seems to generate significantly improved results for all cross-corpus experiments.
Unfortunately, with the limited auxiliary data, it is clear that the modeling capacity of the TDNN-F models did not increase beyond that of training on NCHLT-clean data only. In fact, within-corpus variability seemed to increase slightly: The Aux1 data augmentation experiment based on the 30% selection criterion consistently produced lower PERs across all languages. Possibly, more data of a comparable quality may be required for further improvement since adding all 79.04 h of auxiliary data (Aux1 + Aux2) to the training data generalized better to the broadcast news data. An absolute reduction of 0.49 PER was achieved, which might not be statistically significant.
Using a stricter threshold (0.85 PDP) did improve the TDNN-F 1d model’s generalization in all three experiments: Aux1, Aux2, and Aux1 + Aux2. Interestingly, the Aux2 data did show utility even though this data contains high numbers of repeated prompts and therefore only represents a limited vocabulary.

5. Conclusions

The aim of the study presented in this paper was to determine whether imperfect speech data could be used to improve the performance of ASR systems in under-resourced languages. The specific case considered involved data that was collected but not released because it did not meet project requirements. Given the severe lack of data in the languages under consideration, it was crucial to determine if acoustic modeling accuracy could be improved by adding this data to existing resources.
Results indicate that the additional data added very little to modeling capacity when the acoustic models were evaluated on matched test sets. In fact, recognition rates decreased slightly for some languages when the augmented datasets were used for training. In contrast, results obtained for a test set from a different corpus showed that the additional data did improve the models’ ability to maintain performance across different datasets. However, it remains clear that substantially more high quality data is required to improve ASR for South Africa’s 11 official languages.

Author Contributions

The individual contributions of the authors were as follows: conceptualisation, F.d.W. and J.B.; data curation, J.B.; formal analysis, J.B. and F.d.W.; funding acquisition, F.d.W.; investigation, J.B.; methodology, J.B. and F.d.W.; project administration, J.B. and F.d.W.; resources, F.d.W.; software, J.B.; supervision, F.d.W.; validation, F.d.W.; visualisation, F.d.W. and J.B.; writing-original, F.d.W. and J.B.; writing-review and editing, F.d.W. and J.B.


This research was funded by the South African Centre for Digital Language Resources (SADiLaR,


The authors are indebted to Andrew Gill of the Centre for High Performance Computing for providing technical support.

Conflicts of Interest

The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; nor in the decision to publish the results.


The following abbreviations are used in this manuscript:
ASTAfrican Speech Technology
ASRAutomatic speech recognition
BLSTMbi-directional LSTM
DACDepartments of Arts and Culture
DACSTDepartment of Arts Culture Science and Technology
DNNdeep neural network
DSTDepartment of Science and Technology
HTKHidden Markov Model Toolkit
LDClinear discriminant analysis
LSTMlong short-term memory
MLLTmaximum likelihood linear transform
NCHLTNational Centre for Human Language Technology
PDPphone-based dynamic programming
PERphone error rate
SABCSouth African Broadcasting Corporation
RCRLResources for Closely Related Languages
RSGRadio Sonder Grense
SADiLaRSouth African Centre for Digital Language Resources
SAMPASpeech Assessment Methods Phonetic Alphabet
SATspeaker adaptive training
SGMMsubspace Gaussian mixture models
TDNNtime delay neural networks
TDNN-Ffactorized time delay neural networks
WERword error rate
WSJWall Street Journal


  1. Roux, J.C.; Louw, P.H.; Niesler, T. The African Speech Technology Project: An Assessment. In Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04), Lisbon, Portugal, 1 January 2004; pp. 93–96. [Google Scholar]
  2. Badenhorst, J.; Heerden, C.V.; Davel, M.; Barnard, E. Collecting and evaluating speech recognition corpora for 11 South African languages. Lang. Resour. Eval. 2011, 3, 289–309. [Google Scholar] [CrossRef]
  3. Calteaux, K.; de Wet, F.; Moors, C.; van Niekerk, D.; McAlister, B.; Sharma-Grover, A.; Reid, T.; Davel, M.; Barnard, E.; van Heerden, C. Lwazi II Final Report: Increasing the Impact of Speech Technologies in South Africa; Technical Report; CSIR: Pretoria, South Africa, 2013. [Google Scholar]
  4. Barnard, E.; Davel, M.H.; van Heerden, C.; de Wet, F.; Badenhorst, J. The NCHLT speech corpus of the South African languages. In Proceedings of the 4th Workshop on Spoken Language Technologies for Under-Resourced Languages, St. Petersburg, Russia, 14–16 May 2014; pp. 194–200. [Google Scholar]
  5. De Wet, F.; Badenhorst, J.; Modipa, T. Developing speech resources from parliamentary data for South African English. Procedia Comput. Sci. 2016, 81, 45–52. [Google Scholar] [CrossRef]
  6. Eiselen, R.; Puttkammer, M.J. Developing Text Resources for Ten South African Languages. In Proceedings of the Language Resource and Evaluation, Reykjavik, Iceland, 28 May 2014; pp. 3698–3703. [Google Scholar]
  7. Camelin, N.; Damnati, G.; Bouchekif, A.; Landeau, A.; Charlet, D.; Estève, Y. FrNewsLink: A corpus linking TV Broadcast News Segments and Press Articles. In Proceedings of the Language Resource and Evaluation, Miyazaki, Japan, 22 May 2018; pp. 2087–2092. [Google Scholar]
  8. Takamichi, S.; Saruwatari, H. CPJD corpus: Crowdsourced parallel speech corpus of japanese dialects. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC-2018), Miyazaki, Japan, 7–12 May 2018; pp. 434–437. [Google Scholar]
  9. Salimbajevs, A. Creating Lithuanian and Latvian speech corpora from inaccurately annotated web data. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC-2018), Miyazaki, Japan, 7–12 May 2018; pp. 2871–2875. [Google Scholar]
  10. Baumann, T.; Köhn, A.; Hennig, F. The Spoken Wikipedia Corpus collection: Harvesting, alignment and an application to hyperlistening. Lang. Resour. Eval. 2018, 1–27. [Google Scholar] [CrossRef]
  11. de Vries, N.J.; Davel, M.H.; Badenhorst, J.; Basson, W.D.; de Wet, F.; an Alta de Waal, E.B. A smartphone-based ASR data collection tool for under-resourced languages. Speech Commun. 2014, 56, 119–131. [Google Scholar] [CrossRef]
  12. Jones, K.S.; Strassel, S.; Walker, K.; Graff, D.; Wright, J. Multi-language speech collection for NIST LRE. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016), Portorož, Slovenia, 23–28 May 2016; pp. 4253–4258. [Google Scholar]
  13. Ide, N.; Reppen, R.; Suderman, K. The American National Corpus: More Than the Web Can Provide. In Proceedings of the Third International Conference on Language Resources and Evaluation (LREC’02), Las Palmas, Spain, 29–31 May 2002; pp. 840–844. [Google Scholar]
  14. Schalkwyk, J.; Beeferman, D.; Beaufays, F.; Byrne, B.; Chelba, C.; Cohen, M.; Kamvar, M.; Strope, B. “Your Word is my Command”: Google search by voice: A case study. In Advances in Speech Recognition; Springer: Boston, MA, USA, 2010; pp. 61–90. [Google Scholar]
  15. Cieri, C.; Miller, D.; Walker, K. Research Methodologies, Observations and Outcomes in (Conversational) Speech Data Collection. In Proceedings of the Second International Conference on Human Language Technology Research, San Diego, CA, USA, 24–27 May 2002; pp. 206–211. [Google Scholar]
  16. De Wet, F.; Louw, P.; Niesler, T. The design, collection and annotation of speech databases in South Africa. In Proceedings of the Pattern Recognition Association of South Africa (PRASA 2006), Bloemfontein, South Africa, 29 November–1 December 2006; pp. 1–5. [Google Scholar]
  17. Brümmer, N.; Garcia-Romero, D. Generative modeling for unsupervised score calibration. arXiv 2014, arXiv:1311.0707. [Google Scholar]
  18. Davel, M.H.; van Heerden, C.; Barnard, E. Validating Smartphone-Collected Speech Corpora. In Proceedings of the Third Workshop on Spoken Language Technologies for Under-resourced Languages, Cape Town, South Africa, 7–9 May 2012; pp. 68–75. [Google Scholar]
  19. Badenhorst, J.; Martinus, L.; De Wet, F. BLSTM harvesting of auxiliary NCHLT speech data. In Proceedings of the 2019 Southern African Universities Power Engineering Conference/Robotics and Mechatronics/Pattern Recognition Association of South Africa (SAUPEC/RobMech/PRASA), Bloemfontein, South Africa, 28–30 January 2019; pp. 123–128. [Google Scholar]
  20. Young, S.; Evermann, G.; Gales, M.; Hain, T.; Kershaw, D.; Liu, X.; Moore, G.; Odell, J.; Ollason, D.; Povey, D.; et al. The HTK Book. Revised for HTK Version 3.4. 2009. Available online: (accessed on 27 June 2019).
  21. Povey, D.; Ghoshal, A.; Boulianne, G.; Burget, L.; Glembek, O.; Goel, N.; Hannemann, M.; Motlicek, P.; Qian, Y.; Schwarz, P.; et al. The Kaldi speech recognition toolkit. In Proceedings of the IEEE 2011 Workshop on Automatic Speech Recognition and Understanding, Hilton Waikoloa Village, Big Island, HI, USA, 11–15 December 2011. [Google Scholar]
  22. Badenhorst, J.; de Wet, F. The limitations of data perturbation for ASR of learner data in under-resourced languages. In Proceedings of the 2017 Pattern Recognition Association of South Africa and Robotics and Mechatronics (PRASA-RobMech), Bloemfontein, South Africa, 30 November–1 December 2017; pp. 44–49. [Google Scholar]
  23. van Heerden, C.; Kleynhans, N.; Davel, M. Improving the Lwazi ASR baseline. In Proceedings of the Interspeech 2016, San Francisco, CA, USA, 8–12 September 2016; pp. 3534–3538. [Google Scholar]
  24. Peddinti, V.; Povey, D.; Khudanpur, S. A time delay neural network architecture for efficient modeling of long temporal contexts. In Proceedings of the INTERSPEECH 2015 16th Annual Conference of the International Speech Communication Association, Dresden, Germany, 6–10 September 2015; pp. 3214–3218. [Google Scholar]
  25. Waibel, A.; Hanazawa, T.; Hinton, G.; Shikano, K.; Lang, K.J. Phoneme recognition using time-delay neural networks. IEEE Trans. Acoust. Speech Signal Process. 1989, 37, 328–339. [Google Scholar] [CrossRef]
  26. Sak, H.; Senior, A.; Beaufays, F. Long short-term memory based recurrent neural network architectures for large vocabulary speech recognition. arXiv 2014, arXiv:1402.1128. [Google Scholar]
  27. Yu, Z.; Ramanarayanan, V.; Suendermann-Oeft, D.; Wang, X.; Zechner, K.; Chen, L.; Tao, J.; Ivanou, A.; Qian, Y. Using bidirectional LSTM recurrent neural networks to learn high-level abstractions of sequential features for automated scoring of non-native spontaneous speech. In Proceedings of the 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), Scottsdale, AZ, USA, 13–17 December 2015; pp. 338–345. [Google Scholar]
  28. Peddinti, V.; Wang, Y.; Povey, D.; Khudanpur, S. Low latency acoustic modeling using temporal convolution and LSTMs. IEEE Signal Process. Lett. 2018, 25, 373–377. [Google Scholar] [CrossRef]
  29. Karafiat, M.; Baskar, M.K.; Vesely, K.; Grezl, F.; Burget, L.; Černocký, J.C. Analysis of multilingual BLSTM acoustic model on low and high resource languages. In Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, 15–20 April 2018; pp. 5789–5793. [Google Scholar]
  30. Huang, X.; Zhang, W.; Xu, X.; Yin, R.; Chen, D. Deeper Time Delay Neural Networks for Effective Acoustic Modeling. J. Phys. Conf. Ser. 2019, 1229, 012076. [Google Scholar] [CrossRef]
  31. Chen, K.; Zhang, W.; Chen, D.; Huang, X.; Liu, B.; Xu, X. Gated Time Delay Neural Network for Speech Recognition. J. Phys. Conf. Ser. 2019, 1229, 012077. [Google Scholar] [CrossRef]
  32. Povey, D.; Cheng, G.; Wang, Y.; Li, K.; Xu, H.; Yarmohammadi, M.; Khudanpur, S. Semi-Orthogonal Low-Rank Matrix Factorization for Deep Neural Networks. In Proceedings of the Interspeech 2018, Hyderabad, India, 2–6 September 2018; pp. 3743–3747. [Google Scholar]
  33. van der Westhuizen, E.; Niesler, T.R. Technical Report SU-EE-1501 An Analysis of the NCHLT Speech Corpora; Technical Report; Stellenbosh University of Zurich, Department of Electrical and Electronic Engineering: Stellenbosch, South Africa, 2015. [Google Scholar]
  34. Loots, L.; Davel, M.; Barnard, E.; Niesler, T. Comparing manually-developed and data-driven rules for P2P learning. In Proceedings of the 20th Annual Symposium of the Pattern Recognition Association of South Africa (PRASA), Stellenbosch, South Africa, 30 November–1 December 2009; pp. 35–40. [Google Scholar]
  35. de Wet, F.; de Waal, A.; van Huyssteen, G.B. Developing a broadband automatic speech recognition system for Afrikaans. In Proceedings of the INTERSPEECH 2011, 12th Annual Conference of the International Speech Communication Association, Florence, Italy, 27–31 August 2011; pp. 3185–3188. [Google Scholar]
  36. Peddinti, V.; Chen, G.; Povey, D.; Khudanpur, S. Reverberation robust acoustic modeling using i-vectors with time delay neural networks. In Proceedings of the INTERSPEECH 2015 16th Annual Conference of the International Speech Communication Association, Dresden, Germany, 6–10 September 2015; pp. 2440–2444. [Google Scholar]
  37. Ko, T.; Peddinti, V.; Povey, D.; Khudanpur, S. Audio augmentation for speech recognition. In Proceedings of the INTERSPEECH 2015 16th Annual Conference of the International Speech Communication Association, Dresden, Germany, 6–10 September 2015; pp. 3586–3589. [Google Scholar]
  38. Povey, D. Kaldi Librispeech TDNN-F 1c Chain Model Example Recipe. Available online: (accessed on 27 June 2019).
  39. Povey, D. Kaldi Librispeech TDNN-F 1d Chain Model Example Recipe. Available online: (accessed on 27 June 2019).
  40. Cheng, G.; Peddinti, V.; Povey, D.; Manohar, V.; Khudanpur, S.; Yan, Y. An exploration of dropout with lstms. In Proceedings of the Interspeech, 2017, Stockholm, Sweden, 20–24 August 2017; pp. 1586–1590. [Google Scholar]
  41. Jurafsky, D.; Martin, J. Speech Lang. Process.; Prentice Hall: Upper Saddle River, NJ, USA, 2000; pp. 153–199. [Google Scholar]
Figure 1. Local phone error rates (PERs) for 400 utterance subsets of the Aux1 data.
Figure 1. Local phone error rates (PERs) for 400 utterance subsets of the Aux1 data.
Information 10 00268 g001
Table 1. Total number of initial (Init) auxiliary recordings (Aux1 and Aux2), number of failed phone alignments (failed) and duration (dur) in hours of additional data per language.
Table 1. Total number of initial (Init) auxiliary recordings (Aux1 and Aux2), number of failed phone alignments (failed) and duration (dur) in hours of additional data per language.
LangAux 1Aux 2
Table 2. Type and token counts for prompts only in NCHLT_TRN and only in NCHLT_TST. Aux1, Aux2: Type and token counts for prompts repeated in auxiliary data.
Table 2. Type and token counts for prompts only in NCHLT_TRN and only in NCHLT_TST. Aux1, Aux2: Type and token counts for prompts repeated in auxiliary data.
Ven14,18837,45613,08549,008673834,0374364394341 52700
Table 3. Type and token counts for prompts in both NCHLT_TRN and NCHLT_TST. Aux1, Aux2: Type and token counts for prompts repeated in auxiliary data. New unique: Type and token counts for new prompts in Aux1 and Aux2.
Table 3. Type and token counts for prompts in both NCHLT_TRN and NCHLT_TST. Aux1, Aux2: Type and token counts for prompts repeated in auxiliary data. New unique: Type and token counts for new prompts in Aux1 and Aux2.
NCHLT_TRN_TSTAux1Aux2New Unique Aux1New Unique Aux2
Table 4. Type and token counts for words only in NCHLT_TRN and only in NCHLT_TST. Aux1, Aux2: Type and token counts for words repeated in auxiliary data.
Table 4. Type and token counts for words only in NCHLT_TRN and only in NCHLT_TST. Aux1, Aux2: Type and token counts for words repeated in auxiliary data.
Table 5. Type and token counts for words in both NCHLT_TRN and NCHLT_TST. Aux1, Aux2: Type and token counts for words repeated in auxiliary data. New unique: Type and token counts for new words in auxiliary data.
Table 5. Type and token counts for words in both NCHLT_TRN and NCHLT_TST. Aux1, Aux2: Type and token counts for words repeated in auxiliary data. New unique: Type and token counts for new words in auxiliary data.
NCHLT_TRN_TSTAux1Aux2New Unique Aux1New Unique Aux2
Table 6. Claimed speaker overlap for matching and close matching metadata fields (names, ID, and telephone numbers) of speakers in the predefined NCHLT development (dev) and test (tst) sets.
Table 6. Claimed speaker overlap for matching and close matching metadata fields (names, ID, and telephone numbers) of speakers in the predefined NCHLT development (dev) and test (tst) sets.
LanguageAux2MatchClose Match
Table 7. PERs for TDNN-BLSTM and TDNN-F baseline systems per language (lowest PERs in bold).
Table 7. PERs for TDNN-BLSTM and TDNN-F baseline systems per language (lowest PERs in bold).
Table 8. PERs for TDNN-F baseline (base) and 30% PER selection criterion systems per language (lowest PERs in bold).
Table 8. PERs for TDNN-F baseline (base) and 30% PER selection criterion systems per language (lowest PERs in bold).
LangAux d u r (h)base30%base30%
Table 9. PERs for two Afr test sets and TDNN-F systems trained on different augmented training sets (lowest PERs in bold).
Table 9. PERs for two Afr test sets and TDNN-F systems trained on different augmented training sets (lowest PERs in bold).
System Aux dur (h)TDNN-F 1cTDNN-F 1d
Aux1 + Aux279.044.9326.573.9822.80
Aux1 (30%)27.794.8725.554.5923.68
Aux1 (0.85 PDP)8.174.6428.094.4822.87
Aux2 (0.85 PDP)19.745.0625.474.2222.65
Aux1 + Aux2 (0.85 PDP)28.564.8927.044.2322.06

© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (
Back to TopTop