First, in
Section 5.1, we present the language modelling and N-best list rescoring results achieved by the optimisation of the monolingual pretraining of the LSTM language models as described in
Section 4.4. In
Section 5.2, we show the improvements afforded by optimising the augmented n-gram language model, and additional performance afforded by utilising both augmented n-grams as well as the pretrained LSTM for N-best rescoring as described in
Section 4.5. Finally in
Section 5.3, the results afforded by applying fine-tuned publicly available pretrained models, as described in
Section 4.6, in N-best rescoring experiments, are presented.
5.1. Optimisation of LSTM Pretraining
In this section we present the results of the experiments that investigate the best pretraining strategy for the LSTM language model presented in
Section 4.4. These results will be both in terms of perplexity and word error rate after N-best rescoring. We compare the performance of the pretrained model with the baseline LSTM language model (N-LM
B), which is trained solely on the soap opera training data as described in
Section 4.3. In each experiment, the LSTM is pretrained using data interleaved at the batch (
) or sequence (
) level as described in
Table 4 for between one and five epochs.
Table 5 shows that, on average over all four language pairs and five training epochs, the model pretrained using only the out-of-domain monolingual data (N-LM
M) affords the largest improvements in perplexity of 42.87% relative to the baseline (N-LM
B). Additionally, we find that interleaving at the sequence level (
) is better than at the batch level (
), affording a 45.4% average relative improvement in perplexity compared to a 40.33% average improvement relative to the baseline (N-LM
B).
When considering the application of these same language models in the 50-best list rescoring, similar trends are apparent in terms of speech recognition performance. In
Table 6, we find that interleaving the monolingual sets at the sequence level again affords the largest improvement, both in overall word error rate and in code-switched bigram error. We find that, on average over all four language pairs and over all the pretraining epochs, this strategy leads to an absolute improvement of the average test set word error rate and code-switched bigram error of 1.65% and 1.26% compared to the baseline (ASR
B) respectively, outperforming the batch level interleaving, which afforded average improvements of 1.56% and 1.13%.
When pretraining on the synthetic code-switched data, it appears from the results in both
Table 5 and
Figure 2 that subsequent fine-tuning on the soap opera data is not successful. In fact, columns S and S_M of the figure make it clear that for all four languages the code-switched losses immediately diverge when fine tuning begins. We believe this may be caused by over-fitting on the synthetic data, therefore we investigated pretraining for fewer training batches (1, 100, 500, and 1000). However, we found that models incorporating the synthetic data are still outperformed by those pretrained on the monolingual data. When pretraining on the monolingual data, it is clear that the performance over code-switches is poor. However, during fine-tuning, performance improves. In fact, these models (M) exhibit better performance over code-switches than the models that are exposed to synthetic code-switched data during pretraining (S and S_M).
We therefore conclude that the best pretraining strategy in terms of speech recognition performance is to mix the monolingual datasets at the sequence level, and pretrain for three or four epochs. Given this strategy, selecting the model with the best development set word error rate over the five pretraining epochs (highlighted in
Table 6) affords test set absolute word error rate improvements compared to the baseline speech recognition system (ASR-B), as outlined in
Section 4.2, of 3.17%, 0.42%, 1.16%, and 2.47% for isiZulu, isiXhosa, Sesotho, and Setswana, respectively. We note deterioration in speech recognition at code-switches (CSBG) for isiXhosa and Sesotho of 1.74% and 1.13%, respectively, while isiZulu and Setswana are improved by 4.03% and 3.32%.
5.2. N-gram Augmentation
In
Table 7, we present both the language model perplexities and speech recognition word error rates when utilising the interpolated n-gram language models trained on the respective corpora outlined in
Table 4. We find that the interpolated models that incorporate n-grams from the soap opera data, the synthetic code switched data, and the monolingual data (LM
B+S+M) on average afford the largest improvement in development set word error rate as well as perplexity, and improve the test set word error rate by between 1.97% and 3.24% absolute compared to the baseline (ASR
B). Additionally, absolute improvements in code-switched bigram error compared to the baseline of 1.23% and 3.19% are achieved for isiZulu and Setswana, respectively, while isiXhosa and Sesotho deteriorate by 0.29% and 0.75%, respectively. By incorporating the additional monolingual and synthetic data to train an interpolated n-gram model we are able to consistently improve absolute recognition accuracy on average by 2.53% compared to the baseline model.
Table 7 also shows the results of rescoring the N-best lists with the best performing pretraining strategy (N-LM
M) identified in
Section 5.1. Specifically, the table presents the test set results corresponding to the best development set word error rate from the five pretrained and fine-tuned models—
in
Table 6. We rescore the N-best lists generated by both the n-gram models trained using the soap opera and monolingual data (LM
B+M), as well as those that incorporated the soap opera, monolingual, and synthetic data (LM
B+S+M). On average, over the four language pairs, we find that rescoring the hypotheses generated using the n-gram incorporating the soap opera, monolingual, and synthetic data (LM
B+S+M) lead to the largest improvements in the development set word error rate compared to the baseline (ASR
B). These language models achieved corresponding improvements in the test set word error rate of 3.5% on average compared to the baseline, and 0.98% over the n-gram trained using the soap opera, synthetic, and monolingual data. We find that including the synthetic data improves both language modelling and speech recognition when achieved by n-gram augmentation. This is in contrast to the lack of improvement seen when the same data is used in LSTM pretraining. This is consistent with our expectation, as the data was optimised specifically to improve speech recognition when utilised for n-gram augmentation.
It is also clear that the improvements afforded by rerunning speech recognition experiments after n-gram augmentation are similar to those afforded by the rescoring experiments, which suggests that especially in computationally constrained settings, the augmentation of only the n-gram models for lattice generation should be favoured since it is far less computationally expensive to implement. More specifically, when comparing the results achieved when the optimally pretrained model is used to rescore the baseline N-best hypotheses (ASR
B + N-LM
M in
Table 7) to those achieved by utilising the same data for n-gram augmentation (LM
B+M), we find that on average over the four language pairs, the n-gram augmentation outperforms the rescoring by 0.69% absolute in overall speech recognition accuracy. However, the rescoring method outperforms the n-gram augmentation in terms of code-switched recognition accuracy by 1.38% absolute on average. This suggests that the neural language models are better able to model the code-switching phenomenon, while the augmented n-grams improve the modelling of monolingual stretches of speech.
Overall, rescoring the N-best lists produced by the augmented n-gram LMB+S+M produces the lowest development set word error rate for isiZulu and Setswana, with corresponding test set improvements of 4.23% and 4.45% absolute compared to the baseline (ASRB). Additionally, we improve the speech recognition accuracy over code-switches for the same languages by 2.05% and 3.13% absolute compared to the baseline. The best development set word error rate for isiXhosa and Sesotho is achieved by rescoring the N-best lists produced by n-gram LMB+M, leading to test set improvements of 1.81% and 2.69% absolute compared to the baseline. We find, however, that speech recognition at code-switches is worse for both of these languages than the baseline. We conclude that utilising the additional data for both n-gram augmentation (LMB+S+M) and LSTM pretraining (N-LMM) for N-best rescoring offers the largest consistent improvements in speech recognition accuracy, and outperforms either strategy employed alone.
5.3. Large Pretrained Language Models
In
Table 8 we present the overall test set speech recognition error rate (WER) as well as the speech recognition error rate specifically over code-switches (CSBG) for the five considered pretrained architectures, as discussed in
Section 4.6. Each pretrained model is used in rescoring experiments in either a zero shot setting (
) or after fine-tuning for between one and ten epochs on either the respective bilingual corpus (
) or the pooled data from all four sub-corpora (
). After fine-tuning, the model that afforded the largest development set word error rate improvements over the ten fine-tuning epochs is selected and used to rescore the corresponding test set N-best list. The resulting test set word error rate is listed in the table.
It is clear from
Table 8 that marginal average speech recognition improvements in zero-shot rescoring (
) are possible for all the bidirectional models except GPT-2. On average over the four language pairs, an absolute improvement of 0.4% is achieved compared to the baseline (ASR
B). While better results were achieved by rescoring with the LSTMs trained only on the in domain data (N-LM
B), it is nevertheless interesting that these models, even in zero-shot settings, are able to improve the recognition accuracy. We believe that this may be due to the language agnostic sub-word encoding strategy used by all the large models in
Table 8. These encodings allow the models to learn rich embeddings across languages. Unlike our own LSTM language model, which distinguishes between words in different languages by appended language tags, the sub-word encoding strategies used by the multilingual models do not. We hypothesise that this allows the model to benefit from the additional training data in related target languages. We aim to explore this hypothesis in further research.
When fine-tuning the pretrained models on the pooled bilingual sets of in domain data (), we find that absolute speech recognition accuracy is improved by 1.79% on average compared to the baseline over all the models and the four language pairs, while models fine-tuned on only the bilingual data afford improvements of 1.23% compared to the baseline.
Interestingly, we note that the BERT model trained on 11 African languages (afriBERTa-S) is outperformed by the BERT model (M-BERT) trained on much more text in unrelated languages. Preliminary experiments using the larger afriBERTa-base, which is comparable in model size to M-BERT, indicated even higher word error rates than those achieved with the smaller afriBERTa-S model.
Over all the large transformer models and language pairs, we find that distilled GPT-2 and multilingual BERT, both fine-tuned on the pooled (
) soap opera training data, afford the largest improvements in development set speech recognition, both overall and over language switches. In fact, the best fine-tuned GPT-2 model outperforms our own LSTM rescoring model pretrained on the out-of-domain bilingual corpora as outlined in
Section 4 by an average of 0.60% absolute on the test set over the four language pairs. Similar trends are seen in terms of code-switched speech recognition error rate, where GPT-2 outperforms the LSTM model by 1.34% absolute on average over the four language pairs. These results are remarkable, given that the data on which GPT-2 was pretrained did not include any of the target languages considered in this work, and uses a vocabulary that is not optimised to represent those same languages.
Furthermore, the multilingual BERT model also affords consistent improvements over all four language pairs in both overall speech recognition and recognition across code-switches. On average over the four language pairs, this model improves test set word error rate by 2.18% and code-switched bigram error by 2.67% compared to the baseline system (ASRB).
When rescoring the N-best lists produced using the augmented n-grams (LM
B+S+M) the performance of M-BERT surpasses the performance of GPT-2. As shown in
Table 7, the multilingual BERT model affords, on average over the four language pairs, improvements of 4.66% and 4.45% in development and test set speech recognition accuracy, respectively. The improvement in code-switched performance achieved by the bidirectional model is further strengthened, improving the test set baseline recognition performance by 3.52%, outperforming the GPT-2 model by 1.41%.
We conclude that fine-tuning large transformer models pretrained on unrelated languages can improve speech recognition accuracy more effectively than carefully fine-tuned LSTM models pretrained on data in the target languages. In terms of effective use of computational resources this is an encouraging result, which shows that under-resourced languages can benefit from large models pretrained on well-resourced languages, even when the under-resourced languages are completely unrelated to those used to train the larger models.
5.4. LSTM + BERT Rescoring
In a final set of experiments, we interpolated the N-best scores obtained by the best LSTM (
Section 5.1) and best M-BERT (
Section 5.3) models. We optimise the interpolation weight over the range 0.05 to 0.95, as shown in
Figure 3. In the figure, interpolation weights closer to one assign more weight to the BERT scores, and conversely interpolation weights closer to zero assign more weight to the LSTM scores.
It is clear from
Table 7 that utilising a combination of both architectures for rescoring is able to marginally improve (0.4% absolute) overall speech recognition performance for all language pairs except English–isiXhosa. However, recognition performance over code-switches is not improved. Additionally, utilising both models incurs the severe computational overhead of training both architectures, as well as requiring each model to rescore the N-best lists.
In future work, we aim to train a large multilingual transformer (comparable to M-BERT) using our South African language data, in order to better assess the performance of the fine-tuned transformer architectures explored here. The LSTM models we have considered receive word-level tokens with language dependent and closed vocabularies, while the transformer models utilise language agnostic sub-word encoding strategies. By closing this gap, we hope to achieve further benefits.