1. Introduction
Automatic speech recognition has long been one of the goals in artificial intelligence research. Its history dates to around the 1950s, when there was an attempt to recognise spoken digits from a single speaker [
1]. Subsequent developments continuously improve the recognition performance of ASR systems up to human-level accuracy [
2,
3]. Its usages can therefore be witnessed in several application areas ranging from home appliances [
4], automobile [
5], health care [
6] and tourism [
7], to name just a few.
Generally, there are two major approaches to designing an ASR system: a classical approach and an end-to-end deep learning approach. In the classical ASR setting, vital aspects of speech recognition namely acoustics, phonetics and linguistics are to be modelled separately using expertise drawn from signal processing and natural language processing. An ASR system is then built from those basic components. Meanwhile, an end-to-end ASR compresses all the essential components into a neural network that tries to capture the relationship between the input speech signal and its transcription.
Considering the widespread presence of ASR systems as part of today’s AI-based systems, it is tempting to conclude that the speech recognition task is, to some extent, a solved problem. This might be true for high resource languages, such as English, but we argue that it is not necessarily true for other minority languages, such as Thai, let alone its dialects like the Northern dialect. Rather interestingly, to the best of our knowledge, Suwanbandit et al. [
8] introduced a large-scale Thai dialect speech corpus and demonstrated the effectiveness of the ASR system across Thai-Central and regional dialects. The proposed work differs from previous studies mainly in that we are developing an ASR system specifically for the Northern dialect to enhance the robustness and accuracy of dialect-specific ASR.
Nonetheless, there exist a few ASR systems for Standard Thai speech that are closely related to our task. Most standard Thai ASR systems are based on the traditional approach. For example, an HMM-based isolated word speech recognition has been proposed in [
9] where a tone detection function was added to detect tone levels. The study reported that the added component can improve the accuracy of ASR for the Thai language. A recent HMM-based ASR model for the Thai language was developed and embedded in a scoring system of MoCA’s language fluency assessment [
10]. The authors proposed a hybrid time delay neural network hidden Markov model (TDNN-HMM) for an acoustic model. It was reported to improve the system’s robustness to noise and the variation in voice quality.
There are several challenges in designing an ASR for a minority/regional/dialectic language. Firstly, for such a low-resource language it is not immediate to conclude which of the two major approaches to ASR would be better. Although end-to-end systems tend to outperform the classical approach [
11,
12], they are known to be data intensive compared to the traditional approach and are often unable to generalise well under limited training examples, as in our case where only a 50 h corpus is available. Furthermore, scaling the dataset size beyond 50 h presents both financial and logistical challenges. It is then natural to ask if pre-training the deep learning model using readily available corpora can alleviate the problem? Moreover, it also remains unclear from which language should we use to pre-train the model? From a transfer learning perspective, pre-training using languages from the same family is preferable, but would this still hold true in this case? Secondly, it also takes time and resources to curate a good quality speech corpus and whether a perfectly transcribed, albeit small, corpus directly translates to a superior ASR performance remains debatable. This is because with limited sample size, learning from perfectly tagged speech corpus may lead to overfitting. On the other hand, allowing some sloppiness in the training transcriptions could possibly act as a regularisation regime that mitigates the overfitting problem.
There are a few recent attempts that try to address the problem of constructing an ASR for low-resource languages using the state-of-the-art end-to-end approach. Interestingly, none of the recent work directly modifies the neural architecture of the existing end-to-end models. This somehow suggests the generality of the existing end-to-end model, and that only language specific customisation is needed to improve the recognition performance. The previously proposed strategies can be categorised into two streams: model pre-training from closely related languages, and manipulation of input/output to/from the end-to-end model. Model pre-training from resource-rich languages is among the most common approaches employed when training data is scarce, for example in the case of low-resource Tibetan language [
13] and Turkic languages [
14]. In terms of input/output manipulation, the work in [
15] put forward an idea to represent related alphabets in the low-resource language using a common alphabet and try to reconstruct the original alphabets back from the output of the end-to-end model with the help of an external module. Meanwhile, the work in [
16] tried to manipulate the output of the end-to-end model by incorporating a more complex recurrent neural network-based language model instead of the widely used n-gram model. Despite the recent progress, all of the work on low-resource language relies on language-specific knowledge and that the previously presented solutions might not be directly applicable to our case.
From the research questions and challenges, we set out to investigate two major approaches to build an ASR system for the Northern Thai language dialect. The effects of corpus quality and the use of auxiliary information, in the form of tonal information, model pre-training and language models, on the performance of both approaches shall also be studied. We consider the following to be our main contributions.
We identified key linguistic features of the Northern Thai dialect and proposed to incorporate such features in the traditional HMM-based ASR and the end-to-end deep learning ASR models.
We studied the effects of transcription quality, tonal information and auxiliary information, in the form of model pre-training and language models, on the recognition performance of the ASR systems.
We extensively validated both approaches on a manually curated 50 h Northern Thai language dialect speech corpus.
The empirical results from the studies are expected to shed light on essential components required for developing an ASR for the Northern Thai language. Additionally, it is worth noting that our findings will help confirm some previously established outcomes, such as the impact of tonal information and language models, which have been shown in various languages, that they still hold true in the case of the Northern Thai dialect.
The rest of the paper is organised as follows.
Section 2 gives an overview of the current approaches to ASR.
Section 3 introduces the Northern Thai dialect from which its linguistic features are identified. Corpus collection and the details of the ASR models employed in this study are given in
Section 4.
Section 5 presents the experimental results and discussions while
Section 6 concludes the study.
2. Background and Related Work
Suppose we are given a sequence of windowed speech signals X: = x
1, …, x
T of length T, where x
i is a vector of signal amplitudes. We are also provided with the transcription Y: = y
1, …, y
N where N denotes the number of words in the sequence. We note that T needs not equal to N and the sequences X, Y are not necessarily aligned. ASR can be seen as the task of devising a function f:X
→ Y, which maps speech signal to its corresponding transcription with high accuracy. In the case of the traditional ASR, we can view the function f as a composite function e.g., f(·) = (d◦h◦g)(·) where g(·) is a function that extracts features from input signal, h(·) is a function that captures acoustic relationships and d(·) acts as a decoding function. This is illustrated in
Figure 1. Alternatively, the function could be modelled using a deep neural network as depicted in
Figure 2.
To this end, we shall recall the readers on the two most common approaches to ASR namely, the classical approach and the end-to-end approach. We shall then give an overview of the study concerning speech recognition of dialects that are closely related to the Northern Thai language.
2.1. Classical ASR
We regard the traditional chain of isolated tasks namely feature extraction, acoustic modelling, language modelling and finally state decoding as the classical approach. Probabilistically, the most probable transcription Y* can be estimated by maximising P(Y|X), for all possible word sequences W [
11] that could be formed using the words from a finite vocabulary:
where the second equation is obtained using the Bayes’ rule [
17,
18]. Traditionally, the conditional probability term, e.g., P(X|Y) can be modelled using an acoustic model, while the prior probability will be taken care of by a language model [
19].
2.1.1. Feature Extraction
The first step in the classical ASR is to extract features from raw sound input. Mel-frequency cepstral coefficients (MFCCs) [
20] are usually employed at this stage. The cepstral coefficient features might be subjected to de-correlation and supervised dimensionality reduction, such as linear discriminant analysis (LDA) [
21] where classes are taken to be the number of states, which is around 2000–3000 states in general. Maximum likelihood linear transform (MLLT) [
22,
23,
24] is then used to further de-correlate the features. The reason is to facilitate state modelling using the diagonal covariance matrix. The next step is to normalise the features by taking into account variability in speakers using feature-space maximum likelihood linear regression (fMLLR) [
24]. We note that the MFCC features consider only the relations in the phonetic frames without considering the relationship between them. Since the signals are basically continuous, we could improve recognition performance by incorporating the dynamic between phonetic frames. Here, the delta function, which is the Fourier transform of the sequence of the phonetic frames, could be employed to capture the dynamic between them.
2.1.2. Acoustic Modelling
Acoustic modelling is essentially a process of estimating the probability of a segment of acoustic feature coming from a class of phonemes. Basically, it is the task of modelling the class-conditional probability P(X|Y) in Equation (2). Traditionally, a HMM [
25,
26,
27] is used for this purpose. In HMM for speech recognition, the states are phonemes and the observations are the acoustic feature vectors e.g., MFCC features. Gaussian mixture model (GMM) can be incorporated to estimate the class-conditional probabilities of observed acoustic feature vectors given the phonemes. This mixture model and the HMM are trained using the EM algorithm. Later, the HMM-GMM acoustic model [
28] could be replaced with Deep Neural Network (DNN) in hope of capturing more complex relationships. We refer to the work in [
29] for more detail.
2.1.3. Language Modelling
The primary function of the language model is to represent how likely we observe the sequence Y in the absence of any observed acoustic feature, i.e., P(Y). In other words, the language model represents a probabilistic model of word sequences. One of the widely used language models is probably the N-grams model [
30], which essentially estimates the probability of N-words appearing together, and in the specific order, by looking at the co-occurrence of those N words. The transcription that is obtained from an ASR that considers both acoustic model and language model tends to be more accurate.
2.2. Deep Learning-Based End-to-End ASR
Differing from the classical ASR, an end-to-end ASR uses a single model to directly map the speech waveform with pitch information into the target word sequence. Recent studies demonstrated that the end-to-end approach outperformed the classical HMM-based ASR in terms of word error rate. One of the widely used end-to-end models is probably the Wav2Vec 2.0 model [
31]. The model showed promising performance for several downstream tasks, such as speech recognition for Brazilian Portuguese [
32] and for Bengali [
33]. The model is also found to be useful for speech related tasks, such as emotion recognition from speech [
34]. The Wav2Vec 2.0 model is a self-supervised learning framework for finding representations of a raw audio input. It utilises a multi-layer convolution neural network and a Transformer network for transforming an input acoustic signal to latent speech and contextualised representations, respectively. The contextualised representations are then used to generate an output sequence.
To overcome the need for a large collection of labelled speech for training the model, Wav2Vec 2.0 is often trained in two phases. The first phase involves self-supervised learning that allows us to pre-train the model on unlabelled data. The model is then fine-tuned on a particular language data for a specific downstream task. XLSR [
35] and XLS-R [
36] are other promising models, which are based on the Wav2Vec 2.0 model. The models rely on self-supervised learning on an extensive set of languages ranging from 53 languages (XLSR) to 128 languages (XLS-R). Moreover, XLS-R is pre-trained on tonal languages, including Mandarin and Vietnamese, which are similar to Thai in their use of tone. Like Wav2Vec 2.0, XLSR and XLS-R are usually employed by fine-tuning the recognition part of the model on a new downstream language task, while keeping the representation learning part unchanged. This assumes that the spoken languages share their majority of acoustic features [
37,
38,
39] similar to the implicit hypothesis used in transfer learning in computer vision [
40].
2.3. Speech Recognition for Dialects
Since our goal is to build an ASR system for Northern Thai, a dialect of Thai language, it is then necessary to review the existing work on dialect speech recognition, especially the Asian ones. The majority of language speech recognition work is centred around Indian and Chinese languages. This is because India has 22 official languages while China has eight main dialect groups. For Indian languages, the ASR systems were developed for Kannada [
41] Malayalam [
42] and Punjabi [
43] using the Kaldi toolkit [
44]. In [
42], an ASR system of Malayalam was developed following the classical HMM-based approach, and MFCC, LDA and MLLT were extracted and used as acoustic features. The results showed that using the triphone model with MLLT features yields the best accuracy. In [
43], the acoustic model was created using GMM-HMM. The results showed that LDA+MLLT features obtained superior performance than Delta + Delta-Deltas features and LDA + MLLT + SAT features worked better than LDA + MLLT features. Furthermore, the triphone model gives the best accuracy for both Malayalam and Punjabi speech recognition. In [
41], the acoustic model was created using GMM and Subspace GMM (SGMM). In order to capture all possible pronunciations, the speech data is collected from different regions, and the ASR systems were developed separately for all districts. The results showed that WER of 9.34–11.25% was obtained. Moreover, pitch features are considered when building ASR systems and the results showed that the performance of ASR improves with pitch features for the tonal language. Most Indian languages are low-resource languages. This is similar to Northern Thai and that the reported configurations could be useful in our case.
Chinese also has various dialect languages, such as standard Mandarin, Shanghai, Hokkien, Sichuan or Minnan, to name a few. The work in [
45] studied an end-to-end neural network using the Oriental Language Recognition (OLR) dataset to build an ASR system for Hokkien, Sichuanese, Shanghainese and Cantonese, which are low-resource languages. They used the speech processing toolkit ESPnet, which integrates transformer, advanced data enhancement and conformer. To overcome the lack of dialect data resources, they proposed a method of adding unrelated languages for joint training and adding Chinese language model for joint decoding for dialect speech recognition. The results showed that the WER was improved by 12% compared to the baseline system.
3. Northern Thai Dialect
In this Section, we introduce the readers to the Northern Thai dialect. We shall highlight linguistic features of the dialect and compare it with Standard Thai. Northern Thai is a Thai dialect used for communication between speakers in Northern Thailand, including Chiang Rai, Chiang Mai, Phayao, Lamphun, Lampang, Mae Hong Son, Phrae, and Nan provinces. It is recognised by a variety of names in both popular and academic circles, including Kham Muang, Lanna, and Tai Yuan. In terms of dialectology, dialect differences are the result of two important factors: language change and the expansion of speech communities [
46]. Hence, as a regional dialect of Thai, it is well known that Northern Thai differs from Standard Thai in terms of sound system and lexical items.
Although Northern Thai and Standard Thai share common linguistic characteristics, their systematic differences have important implications for speech recognition system design. Previous research has shown that dialectal speech recognition requires careful consideration of phonological variation, lexical differences, and prosodic patterns, particularly when working with low-resource languages [
1,
2,
3]. Understanding the linguistic differences between Northern Thai and Standard Thai is therefore essential for making informed decisions about feature selection and model architecture.
3.1. Sound System
Compared with Standard Thai, the Northern Thai sound system is the linguistic characteristic that can be considered the most obvious difference. This difference can be observed in terms of consonants, vowels, and tones. The consonants of Northern Thai and Standard Thai are different in terms of their initial consonants and consonant clusters. It was found that Northern Thai has one fewer consonant phoneme than Central Thai. Based on IPA-based phonological descriptions, the Central Thai consonant inventory consists of 21 phonemes whereas Northern Thai consists of 20 consonant phonemes [
47,
48,
49].
In this number of different phonemes, a phoneme that appears in Standard Thai but does not appear in Northern Thai is /r/. The variation in the /r/ in Standard Thai to the /h/ and /l/ in Northern Thai is a phenomenon generally recognized by most Thais.
In the opposite direction, Northern Thai preserves the initial consonant /ɲ/ as a distinct phoneme, while in Standard Thai it has largely merged with /j/ due to sound change. However, this distinction is increasingly threatened in Northern Thai due to the influence of Standard Thai [
50]. In addition, another notable difference between Northern Thai and Standard Thai is that most of the unaspirated plosive sounds in Northern Thai correspond to aspirated plosive sounds in Standard Thai. The consonant clusters are another characteristic that differs between the two languages. Northern Thai has a slightly lower number of phoneme clusters than Standard Thai. Northern Thai consists of 11 phoneme clusters i.e., /tw/, /cw/, /kw/, /k
hw/, /sw/, /ɲw/, /ŋw/, /lw/, /jw/, /ʔw/, and /hw/, whereas Standard Thai consists of 12 phoneme clusters i.e., /pr/, /p
hr/, /tr/, /t
hr/, /kr/, /k
hr/, /pl/, /p
hl/, /kl/, /k
hl/, /kw/, and /k
hw/. However, there are some consonant clusters, which correspond to each other, for example, /pr/ in Standard Thai correspond to /ph/ in Northern Thai.
In terms of vowels, there is no difference in the number of vowel phonemes between Northern Thai and Standard Thai. Both languages consist of the same 24 vowel phonemes. They can be divided into 18 single vowel phonemes and six diphthongs. The single vowels are divided into nine short vowels and nine long vowels that contrast with each other. In terms of diphthongs, both languages have diphthongs consisting of a high vowel followed by /a/, i.e., /ia/, /iːa/, /ɯa/, /ɯːa/, /ua/, and /uːa/.
Although Northern Thai and Standard Thai are tonal languages, they have different tonal systems. There are different phonologically contrasting tones in which Northern Thai has six tones while Standard Thai has only five tones [
51,
52,
53]. Among these differences, there is a tone that appears only in Northern Thai, which is a high-level tone with glottal closure.
3.2. Lexical Items
Although Northern Thai and Standard Thai are Tai languages, both languages have developed a large number of their own lexical items. As a result, there are differences in lexical characteristics in both languages. Considering the form and meaning of words, the lexical differences between Northern Thai and Standard Thai can be divided into two categories: lexical items that sound the same but have different meanings and lexical items that have the same meaning but are pronounced differently [
53].
Lexical items in the first category are understood as homonyms. They have the same consonants, vowels and tones in both languages but have different meanings or in Northern Thai may have a broader meaning than in Standard Thai. Examples include:
| หัน [hǎn]: | ‘to see’ (NT), ‘to turn away’ (CT) |
| แก่น [kɛ̀n]: | ‘core’ (both); also ‘classifier for egg/fruit’ (NT only) |
For lexical items that have the same meaning but are pronounced differently, they are usually nouns used to refer to plants, animals, organs, things, or tools:
| ‘peppermint’: | NT หอมด่วน [hǒ:m.dù:an], | CT สะระแหน่: [sa.ra.nɛ̀:] |
| ‘parrot’: | NT นกแล [nók.lɛː], | CT นกแก้ว [nók.kɛ̂:w] |
| ‘nose’: | NT ดัง [daŋ], | CT จมูก [tɕ.mù:k] |
| ‘belt’: | NT สายฮ้าง [sǎ:j.há:ŋ], | CT เข็มขัด [khěm. khàt] |
Considering the aforementioned differences, it is obvious that an ASR system built for Standard Thai might not work optimally for Northern Thai. This prompts the development of the Northern Thai dialect speech recognition system in this study. We expect that incorporating the key linguistic features in the form of the tonal information and Northern Thai language model could improve recognition performance of the existing ASR systems that we are building upon.
4. Methodology
In this Section, we introduce the workflow of the ASR systems under investigation.
Figure 3 summarises two configurations from input stage to output stage that we tested in the experiments. In the input stage, two options of corpus are possible: original and corrected. Next, auxiliary information in the form of tone can be incorporated into an HMM-based model. In the modelling stage, HMM-based model (Kaldi) and end-to-end model (XLS-R) will be considered. Language model can be employed in the decoding stage before we obtain the predicted transcription, from which the recognition performance of any ASR configuration can be measured. We now delve into the detail of the input stage (
Section 4.1), HMM-based model (
Section 4.2) and the end-to-end model (
Section 4.3).
4.1. Northern Thai Dialect Corpus
The Northern Thai dialect speech corpus, which is apart from PaSCoNT [
54], was collected from two hundred native speakers who have been living in Chiang Mai province for more than 18 years. The corpus is composed of the speech data from daily dialogues containing commands, requests, questions, narratives about daily talks, weather, travel, making merit, product prices, route inquiries, requests for help, general storytelling, such as songs, dramas, movies, singers and actors, and news [
54]. The speakers are 100 males and 100 females of different age groups between 18 and 75 years old. Each speaker was assigned 100 Northern Thai dialect sentences. In total, 50 h of 20,000 speech files were recorded. The audio recordings were set to 44 kHz sample rate, 16-bits mono without any sound compression using Rode microphones. The microphone was kept 3 to 4 inches far from the speaker’s mouth. In this study, voices were recorded in both opened and closed environments.
After finishing the speech recording, all WAV files were carefully listened to by a native speaker of Northern Thai who was linguistically experienced in Northern Thai. Annotation was performed by the same linguist, who check the correspondence between the text transcription and the speech of each utterance and edit the text transcription to align with the recorded speech. This is necessary as some words could be accidentally omitted or inserted by the speakers during recording.
We note that the annotator was blinded to the identity of the speakers. The transcriptions were then randomly selected and verified by the authors. All text transcriptions of utterances were transcribed with the International Phonetic Alphabet (IPA) to express their pronunciation. In the phonetic transcription of utterances, the tonal annotation was integrated in words and a hyphen was inserted between syllables in words containing more than one syllable, to indicate syllable boundaries. Further details regarding data annotation and transcription are provided in [
54]. Ultimately, we ended up with two speech corpora differing in the transcriptions: original versus cleaned. The transcription of the first corpus is taken directly from the text script given to the speakers. We refer to this corpus as original. The transcription of the second corpus was manually checked and corrected by the experts. We shall refer to this corpus as corrected. These two corpora will be used to investigate the impact of corpus quality on the ASR.
It is worth noting that Thai is an unsegmented language, which has no boundary marker between words. Moreover, the word segmentation in Northern Thai and Standard Thai are different in some words, such as “ya-pai” in Standard Thai meaning “ja-pai-pai” in Northern Thai. Therefore, the above corrected transcriptions were manually segmented by the expert. After that, the segmented transcriptions can be used for constructing language models. We also note that the IPA symbols were used as standard phoneme sets. From this, we have created the pronunciation dictionary (lexicon) of Northern Thai dialect, which consists of 450,674 words and 3329 vocabulary items.
4.2. Hidden Markov Model-Based ASR
The first ASR system is implemented using the Kaldi toolkit [
44], which can accommodate both language modelling and acoustic modelling. The decoder is based on finite-state transducers (FSTs) and incorporates the language model, the pronunciation dictionary (lexicon), and context dependency into a single decoding graph.
In this study, we developed two types of acoustic models: one using a regular hidden Markov model (HMM) and the other using a DNN-HMM based on a time-delay neural network (TDNN) architecture [
55,
56,
57]. This architecture constructs larger networks from sub-components in a modular and progressive manner. It has been demonstrated to efficiently learn the temporal dynamics of the signal from short-term feature representations. As a starting point, we employed the Gaussian mixture model (GMM) with a tri-state HMM topology with probability density functions tied through a phonetic decision tree. The 13 MFCCs plus pitch information serve as the input feature basis for the acoustic models. The detail of the training and the hyperparameter settings are described in
Appendix A, Section
Appendix A.1.
We utilized the SRILM toolkit [
58] with a modified Kneser–Ney (KN) discounting [
59] to train 3, 4, and 5-gram language models (LMs). KN is based on absolute-discounting interpolation, which reallocates some probability mass from 4-gram or 3-gram models to simpler unigrams. This smoothing technique constructs a lower-order distribution that is consistent with the smoothed higher-order distribution. The training data for LMs contains 450,674 words from the Northern Thai dialect speech corpus.
Figure 4 presents the overview of the HMM-based model as implemented using the Kaldi toolkit.
4.3. Deep Neural Network-Based ASR
We employed five variations in deep neural networks with varying numbers of parameters and size of pre-training data. All of the models were trained in an end-to-end manner: that is the model receives a raw speech signal as input and outputs character-based transcriptions. The models under consideration are all based on the current state-of-the-art XLSR and XLS-R models. We note that XLSR and XLS-R share a similar network architecture. That is, they have the same settings for the number of blocks, the number of hidden states, the inner dimension of feed forward blocks, the number of attention heads, and the total number of parameters [
36]. The difference between XLSR and XLS-R is the size and the number of languages in the training corpus. XLSR was trained on 50 h of public speech data in 53 languages, while XLS-R was trained on 436K hours of speech audio in 128 languages.
As such, we shall refer to both models as XLS-R. We shall be experimenting with the XLS-R of size 300 million and 1 billion parameters. Larger models generally yield better performance providing that there is sufficient data to train the model. In limited data as in our case, it remains to be investigated the effect of model size on the recognition performance.
To alleviate the negative effect of small data we will consider pre-training XLS-R on publicly available speech corpus of 53 languages and 128 languages. We also consider an XLS-R pre-trained from a closely related language, such as Standard Thai. This is to examine the degree to which pre-training can alleviate the data requirement of deep learning models.
Lastly, we also incorporate word-level N-gram language models of order 3 to 5 to further improve the transcription from the character-based end-to-end models. For fair comparison, the language models are the same ones as those used in the HMM-based ASR. To this end, we summarise all the end-to-end deep learning models in
Table 1.
The detail of fine-tuning the models and the framework of a proposed end-to-end model (XLS-R) for the Northern Thai dialect is presented in
Appendix A, Section
Appendix A.2.
Figure 5 illustrates the XLS-R model employed in this study.
6. Conclusions
We studied two major approaches to ASR for the low-resource Northern Thai dialect. The traditional approach was implemented based on the Kaldi toolkit while we adopted the XLS-R model for the end-to-end approach. We investigated the necessary ingredients that contribute to the performance of the system namely, corpus quality, tonal information, model pre-training and the augmentation of language model. From the experimental results, we found that (1) transcription quality did improve the performance of the HMM-based systems but not those of the end-to-end systems. (2) tonal information can be employed to improve the recognition performance of the HMM-based system. (3) pre-training the end-to-end approach from closely related languages generally helps with the performance of the end to-end systems. (4) auxiliary information in the form of a language model also improves the performance of the end-to-end systems. Ultimately, we learned that an XLS-R-based end-to-end system pre-trained from a Standard Thai speech corpus with 5-grams LM, trained from mildly noisy corpus yields the best performance of 0.94 WER.
In future work, we plan to pursue the research in the direction of personalised ASR systems. This can be achieved using a dynamic language model, which adapts to the context and user’s personality. Auxiliary information from speech signals, such as the entropy-based feature [
62], will become useful for differentiating a speaker’s personality. Emotion embedded in the speech, which can be detected based on Wavelets [
63] representation of the signal, can also be employed to make the language model more adaptive.