The Development of Northern Thai Dialect Speech Recognition System

Bootkrajang, Jakramate; Inkeaw, Papangkorn; Chaijaruwanich, Jeerayut; Taerungruang, Supawat; Boonyawisit, Adisorn; Sutawong, Bak Jong Min; Chunwijitra, Vataya; Taninpong, Phimphaka

doi:10.3390/app16010160

Open AccessArticle

The Development of Northern Thai Dialect Speech Recognition System

by

Jakramate Bootkrajang

^1,2

,

Papangkorn Inkeaw

^1,2

,

Jeerayut Chaijaruwanich

¹

,

Supawat Taerungruang

³

,

Adisorn Boonyawisit

²,

Bak Jong Min Sutawong

²,

Vataya Chunwijitra

⁴ and

Phimphaka Taninpong

^2,5,*

¹

Department of Computer Science, Faculty of Science, Chiang Mai University, 239, Muang, Chiang Mai 50200, Thailand

²

Data Science Research Center, Faculty of Science, Chiang Mai University, 239, Muang, Chiang Mai 50200, Thailand

³

Department of Thai, Faculty of Humanities, Chiang Mai University, 239, Muang, Chiang Mai 50200, Thailand

⁴

NECTEC, National Science and Technology Development Agency (NSTDA), 111 Thailand Science Park Phahonyothin Road, Khlong Nueng, Khlong Luang, Pathum Thani 12120, Thailand

⁵

Department of Statistics, Faculty of Science, Chiang Mai University, 239, Muang, Chiang Mai 50200, Thailand

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2026, 16(1), 160; https://doi.org/10.3390/app16010160

Submission received: 24 November 2025 / Revised: 18 December 2025 / Accepted: 20 December 2025 / Published: 23 December 2025

(This article belongs to the Special Issue Speech Recognition and Natural Language Processing)

Download

Browse Figures

Versions Notes

Abstract

This study investigated the necessary ingredients for the development of an automatic speech recognition (ASR) system for the Northern Thai language. Building an ASR model for such an arguably low-resource language poses challenges both in terms of the quantity and the quality of the corpus. The experimental results demonstrated that the current state-of-the-art deep neural network trained in an end-to-end manner, and pre-trained from a closely related language, such as Standard Thai, often outperformed its traditional HMM-based counterparts. The results also suggested that incorporating northern Thai-specific tonal information and augmenting the character-based end-to-end model with an n-gram language model further improves the recognition performance. Surprisingly, the quality of the transcription of the speech corpus was not found to positively correlate with the recognition performance in the case of the end-to-end system. The results show that the end-to-end ASR system was able to achieve the best word error rate (WER) of 0.94 on out-of-sample data. This is equivalent to 77.02% and 60.34% relative word error rate reduction over the 4.09 and 2.37 WERs of the traditional TDNN-HMM and the vanilla deep neural network baselines.

Keywords:

automatic speech recognition; Northern Thai dialect; deep learning; TDNN-HMM

1. Introduction

Automatic speech recognition has long been one of the goals in artificial intelligence research. Its history dates to around the 1950s, when there was an attempt to recognise spoken digits from a single speaker [1]. Subsequent developments continuously improve the recognition performance of ASR systems up to human-level accuracy [2,3]. Its usages can therefore be witnessed in several application areas ranging from home appliances [4], automobile [5], health care [6] and tourism [7], to name just a few.

Generally, there are two major approaches to designing an ASR system: a classical approach and an end-to-end deep learning approach. In the classical ASR setting, vital aspects of speech recognition namely acoustics, phonetics and linguistics are to be modelled separately using expertise drawn from signal processing and natural language processing. An ASR system is then built from those basic components. Meanwhile, an end-to-end ASR compresses all the essential components into a neural network that tries to capture the relationship between the input speech signal and its transcription.

Considering the widespread presence of ASR systems as part of today’s AI-based systems, it is tempting to conclude that the speech recognition task is, to some extent, a solved problem. This might be true for high resource languages, such as English, but we argue that it is not necessarily true for other minority languages, such as Thai, let alone its dialects like the Northern dialect. Rather interestingly, to the best of our knowledge, Suwanbandit et al. [8] introduced a large-scale Thai dialect speech corpus and demonstrated the effectiveness of the ASR system across Thai-Central and regional dialects. The proposed work differs from previous studies mainly in that we are developing an ASR system specifically for the Northern dialect to enhance the robustness and accuracy of dialect-specific ASR.

Nonetheless, there exist a few ASR systems for Standard Thai speech that are closely related to our task. Most standard Thai ASR systems are based on the traditional approach. For example, an HMM-based isolated word speech recognition has been proposed in [9] where a tone detection function was added to detect tone levels. The study reported that the added component can improve the accuracy of ASR for the Thai language. A recent HMM-based ASR model for the Thai language was developed and embedded in a scoring system of MoCA’s language fluency assessment [10]. The authors proposed a hybrid time delay neural network hidden Markov model (TDNN-HMM) for an acoustic model. It was reported to improve the system’s robustness to noise and the variation in voice quality.

There are several challenges in designing an ASR for a minority/regional/dialectic language. Firstly, for such a low-resource language it is not immediate to conclude which of the two major approaches to ASR would be better. Although end-to-end systems tend to outperform the classical approach [11,12], they are known to be data intensive compared to the traditional approach and are often unable to generalise well under limited training examples, as in our case where only a 50 h corpus is available. Furthermore, scaling the dataset size beyond 50 h presents both financial and logistical challenges. It is then natural to ask if pre-training the deep learning model using readily available corpora can alleviate the problem? Moreover, it also remains unclear from which language should we use to pre-train the model? From a transfer learning perspective, pre-training using languages from the same family is preferable, but would this still hold true in this case? Secondly, it also takes time and resources to curate a good quality speech corpus and whether a perfectly transcribed, albeit small, corpus directly translates to a superior ASR performance remains debatable. This is because with limited sample size, learning from perfectly tagged speech corpus may lead to overfitting. On the other hand, allowing some sloppiness in the training transcriptions could possibly act as a regularisation regime that mitigates the overfitting problem.

There are a few recent attempts that try to address the problem of constructing an ASR for low-resource languages using the state-of-the-art end-to-end approach. Interestingly, none of the recent work directly modifies the neural architecture of the existing end-to-end models. This somehow suggests the generality of the existing end-to-end model, and that only language specific customisation is needed to improve the recognition performance. The previously proposed strategies can be categorised into two streams: model pre-training from closely related languages, and manipulation of input/output to/from the end-to-end model. Model pre-training from resource-rich languages is among the most common approaches employed when training data is scarce, for example in the case of low-resource Tibetan language [13] and Turkic languages [14]. In terms of input/output manipulation, the work in [15] put forward an idea to represent related alphabets in the low-resource language using a common alphabet and try to reconstruct the original alphabets back from the output of the end-to-end model with the help of an external module. Meanwhile, the work in [16] tried to manipulate the output of the end-to-end model by incorporating a more complex recurrent neural network-based language model instead of the widely used n-gram model. Despite the recent progress, all of the work on low-resource language relies on language-specific knowledge and that the previously presented solutions might not be directly applicable to our case.

From the research questions and challenges, we set out to investigate two major approaches to build an ASR system for the Northern Thai language dialect. The effects of corpus quality and the use of auxiliary information, in the form of tonal information, model pre-training and language models, on the performance of both approaches shall also be studied. We consider the following to be our main contributions.

We identified key linguistic features of the Northern Thai dialect and proposed to incorporate such features in the traditional HMM-based ASR and the end-to-end deep learning ASR models.
We studied the effects of transcription quality, tonal information and auxiliary information, in the form of model pre-training and language models, on the recognition performance of the ASR systems.
We extensively validated both approaches on a manually curated 50 h Northern Thai language dialect speech corpus.

The empirical results from the studies are expected to shed light on essential components required for developing an ASR for the Northern Thai language. Additionally, it is worth noting that our findings will help confirm some previously established outcomes, such as the impact of tonal information and language models, which have been shown in various languages, that they still hold true in the case of the Northern Thai dialect.

The rest of the paper is organised as follows. Section 2 gives an overview of the current approaches to ASR. Section 3 introduces the Northern Thai dialect from which its linguistic features are identified. Corpus collection and the details of the ASR models employed in this study are given in Section 4. Section 5 presents the experimental results and discussions while Section 6 concludes the study.

2. Background and Related Work

Suppose we are given a sequence of windowed speech signals X: = x₁, …, x_T of length T, where x_i is a vector of signal amplitudes. We are also provided with the transcription Y: = y₁, …, y_N where N denotes the number of words in the sequence. We note that T needs not equal to N and the sequences X, Y are not necessarily aligned. ASR can be seen as the task of devising a function f:X → Y, which maps speech signal to its corresponding transcription with high accuracy. In the case of the traditional ASR, we can view the function f as a composite function e.g., f(·) = (d◦h◦g)(·) where g(·) is a function that extracts features from input signal, h(·) is a function that captures acoustic relationships and d(·) acts as a decoding function. This is illustrated in Figure 1. Alternatively, the function could be modelled using a deep neural network as depicted in Figure 2.

To this end, we shall recall the readers on the two most common approaches to ASR namely, the classical approach and the end-to-end approach. We shall then give an overview of the study concerning speech recognition of dialects that are closely related to the Northern Thai language.

2.1. Classical ASR

We regard the traditional chain of isolated tasks namely feature extraction, acoustic modelling, language modelling and finally state decoding as the classical approach. Probabilistically, the most probable transcription Y* can be estimated by maximising P(Y|X), for all possible word sequences W [11] that could be formed using the words from a finite vocabulary:

\begin{matrix} Y^{*} = argmax P (Y | X) \\ Y ϵ W \end{matrix}

(1)

\begin{matrix} = argmax P (X | Y) P (Y) \\ Y ϵ W \end{matrix}

(2)

where the second equation is obtained using the Bayes’ rule [17,18]. Traditionally, the conditional probability term, e.g., P(X|Y) can be modelled using an acoustic model, while the prior probability will be taken care of by a language model [19].

2.1.1. Feature Extraction

The first step in the classical ASR is to extract features from raw sound input. Mel-frequency cepstral coefficients (MFCCs) [20] are usually employed at this stage. The cepstral coefficient features might be subjected to de-correlation and supervised dimensionality reduction, such as linear discriminant analysis (LDA) [21] where classes are taken to be the number of states, which is around 2000–3000 states in general. Maximum likelihood linear transform (MLLT) [22,23,24] is then used to further de-correlate the features. The reason is to facilitate state modelling using the diagonal covariance matrix. The next step is to normalise the features by taking into account variability in speakers using feature-space maximum likelihood linear regression (fMLLR) [24]. We note that the MFCC features consider only the relations in the phonetic frames without considering the relationship between them. Since the signals are basically continuous, we could improve recognition performance by incorporating the dynamic between phonetic frames. Here, the delta function, which is the Fourier transform of the sequence of the phonetic frames, could be employed to capture the dynamic between them.

2.1.2. Acoustic Modelling

Acoustic modelling is essentially a process of estimating the probability of a segment of acoustic feature coming from a class of phonemes. Basically, it is the task of modelling the class-conditional probability P(X|Y) in Equation (2). Traditionally, a HMM [25,26,27] is used for this purpose. In HMM for speech recognition, the states are phonemes and the observations are the acoustic feature vectors e.g., MFCC features. Gaussian mixture model (GMM) can be incorporated to estimate the class-conditional probabilities of observed acoustic feature vectors given the phonemes. This mixture model and the HMM are trained using the EM algorithm. Later, the HMM-GMM acoustic model [28] could be replaced with Deep Neural Network (DNN) in hope of capturing more complex relationships. We refer to the work in [29] for more detail.

2.1.3. Language Modelling

The primary function of the language model is to represent how likely we observe the sequence Y in the absence of any observed acoustic feature, i.e., P(Y). In other words, the language model represents a probabilistic model of word sequences. One of the widely used language models is probably the N-grams model [30], which essentially estimates the probability of N-words appearing together, and in the specific order, by looking at the co-occurrence of those N words. The transcription that is obtained from an ASR that considers both acoustic model and language model tends to be more accurate.

2.2. Deep Learning-Based End-to-End ASR

Differing from the classical ASR, an end-to-end ASR uses a single model to directly map the speech waveform with pitch information into the target word sequence. Recent studies demonstrated that the end-to-end approach outperformed the classical HMM-based ASR in terms of word error rate. One of the widely used end-to-end models is probably the Wav2Vec 2.0 model [31]. The model showed promising performance for several downstream tasks, such as speech recognition for Brazilian Portuguese [32] and for Bengali [33]. The model is also found to be useful for speech related tasks, such as emotion recognition from speech [34]. The Wav2Vec 2.0 model is a self-supervised learning framework for finding representations of a raw audio input. It utilises a multi-layer convolution neural network and a Transformer network for transforming an input acoustic signal to latent speech and contextualised representations, respectively. The contextualised representations are then used to generate an output sequence.

To overcome the need for a large collection of labelled speech for training the model, Wav2Vec 2.0 is often trained in two phases. The first phase involves self-supervised learning that allows us to pre-train the model on unlabelled data. The model is then fine-tuned on a particular language data for a specific downstream task. XLSR [35] and XLS-R [36] are other promising models, which are based on the Wav2Vec 2.0 model. The models rely on self-supervised learning on an extensive set of languages ranging from 53 languages (XLSR) to 128 languages (XLS-R). Moreover, XLS-R is pre-trained on tonal languages, including Mandarin and Vietnamese, which are similar to Thai in their use of tone. Like Wav2Vec 2.0, XLSR and XLS-R are usually employed by fine-tuning the recognition part of the model on a new downstream language task, while keeping the representation learning part unchanged. This assumes that the spoken languages share their majority of acoustic features [37,38,39] similar to the implicit hypothesis used in transfer learning in computer vision [40].

2.3. Speech Recognition for Dialects

Since our goal is to build an ASR system for Northern Thai, a dialect of Thai language, it is then necessary to review the existing work on dialect speech recognition, especially the Asian ones. The majority of language speech recognition work is centred around Indian and Chinese languages. This is because India has 22 official languages while China has eight main dialect groups. For Indian languages, the ASR systems were developed for Kannada [41] Malayalam [42] and Punjabi [43] using the Kaldi toolkit [44]. In [42], an ASR system of Malayalam was developed following the classical HMM-based approach, and MFCC, LDA and MLLT were extracted and used as acoustic features. The results showed that using the triphone model with MLLT features yields the best accuracy. In [43], the acoustic model was created using GMM-HMM. The results showed that LDA+MLLT features obtained superior performance than Delta + Delta-Deltas features and LDA + MLLT + SAT features worked better than LDA + MLLT features. Furthermore, the triphone model gives the best accuracy for both Malayalam and Punjabi speech recognition. In [41], the acoustic model was created using GMM and Subspace GMM (SGMM). In order to capture all possible pronunciations, the speech data is collected from different regions, and the ASR systems were developed separately for all districts. The results showed that WER of 9.34–11.25% was obtained. Moreover, pitch features are considered when building ASR systems and the results showed that the performance of ASR improves with pitch features for the tonal language. Most Indian languages are low-resource languages. This is similar to Northern Thai and that the reported configurations could be useful in our case.

Chinese also has various dialect languages, such as standard Mandarin, Shanghai, Hokkien, Sichuan or Minnan, to name a few. The work in [45] studied an end-to-end neural network using the Oriental Language Recognition (OLR) dataset to build an ASR system for Hokkien, Sichuanese, Shanghainese and Cantonese, which are low-resource languages. They used the speech processing toolkit ESPnet, which integrates transformer, advanced data enhancement and conformer. To overcome the lack of dialect data resources, they proposed a method of adding unrelated languages for joint training and adding Chinese language model for joint decoding for dialect speech recognition. The results showed that the WER was improved by 12% compared to the baseline system.

3. Northern Thai Dialect

In this Section, we introduce the readers to the Northern Thai dialect. We shall highlight linguistic features of the dialect and compare it with Standard Thai. Northern Thai is a Thai dialect used for communication between speakers in Northern Thailand, including Chiang Rai, Chiang Mai, Phayao, Lamphun, Lampang, Mae Hong Son, Phrae, and Nan provinces. It is recognised by a variety of names in both popular and academic circles, including Kham Muang, Lanna, and Tai Yuan. In terms of dialectology, dialect differences are the result of two important factors: language change and the expansion of speech communities [46]. Hence, as a regional dialect of Thai, it is well known that Northern Thai differs from Standard Thai in terms of sound system and lexical items.

Although Northern Thai and Standard Thai share common linguistic characteristics, their systematic differences have important implications for speech recognition system design. Previous research has shown that dialectal speech recognition requires careful consideration of phonological variation, lexical differences, and prosodic patterns, particularly when working with low-resource languages [1,2,3]. Understanding the linguistic differences between Northern Thai and Standard Thai is therefore essential for making informed decisions about feature selection and model architecture.

3.1. Sound System

Compared with Standard Thai, the Northern Thai sound system is the linguistic characteristic that can be considered the most obvious difference. This difference can be observed in terms of consonants, vowels, and tones. The consonants of Northern Thai and Standard Thai are different in terms of their initial consonants and consonant clusters. It was found that Northern Thai has one fewer consonant phoneme than Central Thai. Based on IPA-based phonological descriptions, the Central Thai consonant inventory consists of 21 phonemes whereas Northern Thai consists of 20 consonant phonemes [47,48,49].

In this number of different phonemes, a phoneme that appears in Standard Thai but does not appear in Northern Thai is /r/. The variation in the /r/ in Standard Thai to the /h/ and /l/ in Northern Thai is a phenomenon generally recognized by most Thais.

In the opposite direction, Northern Thai preserves the initial consonant /ɲ/ as a distinct phoneme, while in Standard Thai it has largely merged with /j/ due to sound change. However, this distinction is increasingly threatened in Northern Thai due to the influence of Standard Thai [50]. In addition, another notable difference between Northern Thai and Standard Thai is that most of the unaspirated plosive sounds in Northern Thai correspond to aspirated plosive sounds in Standard Thai. The consonant clusters are another characteristic that differs between the two languages. Northern Thai has a slightly lower number of phoneme clusters than Standard Thai. Northern Thai consists of 11 phoneme clusters i.e., /tw/, /cw/, /kw/, /k^hw/, /sw/, /ɲw/, /ŋw/, /lw/, /jw/, /ʔw/, and /hw/, whereas Standard Thai consists of 12 phoneme clusters i.e., /pr/, /p^hr/, /tr/, /t^hr/, /kr/, /k^hr/, /pl/, /p^hl/, /kl/, /k^hl/, /kw/, and /k^hw/. However, there are some consonant clusters, which correspond to each other, for example, /pr/ in Standard Thai correspond to /ph/ in Northern Thai.

In terms of vowels, there is no difference in the number of vowel phonemes between Northern Thai and Standard Thai. Both languages consist of the same 24 vowel phonemes. They can be divided into 18 single vowel phonemes and six diphthongs. The single vowels are divided into nine short vowels and nine long vowels that contrast with each other. In terms of diphthongs, both languages have diphthongs consisting of a high vowel followed by /a/, i.e., /ia/, /iːa/, /ɯa/, /ɯːa/, /ua/, and /uːa/.

Although Northern Thai and Standard Thai are tonal languages, they have different tonal systems. There are different phonologically contrasting tones in which Northern Thai has six tones while Standard Thai has only five tones [51,52,53]. Among these differences, there is a tone that appears only in Northern Thai, which is a high-level tone with glottal closure.

3.2. Lexical Items

Although Northern Thai and Standard Thai are Tai languages, both languages have developed a large number of their own lexical items. As a result, there are differences in lexical characteristics in both languages. Considering the form and meaning of words, the lexical differences between Northern Thai and Standard Thai can be divided into two categories: lexical items that sound the same but have different meanings and lexical items that have the same meaning but are pronounced differently [53].

Lexical items in the first category are understood as homonyms. They have the same consonants, vowels and tones in both languages but have different meanings or in Northern Thai may have a broader meaning than in Standard Thai. Examples include:

หัน [hǎn]:	‘to see’ (NT), ‘to turn away’ (CT)
แก่น [kɛ̀n]:	‘core’ (both); also ‘classifier for egg/fruit’ (NT only)

For lexical items that have the same meaning but are pronounced differently, they are usually nouns used to refer to plants, animals, organs, things, or tools:

‘peppermint’:	NT หอมด่วน [hǒ:m.dù:an],	CT สะระแหน่: [sa.ra.nɛ̀:]
‘parrot’:	NT นกแล [nók.lɛː],	CT นกแก้ว [nók.kɛ̂:w]
‘nose’:	NT ดัง [daŋ],	CT จมูก [tɕ.mù:k]
‘belt’:	NT สายฮ้าง [sǎ:j.há:ŋ],	CT เข็มขัด [k^hěm. k^hàt]

Considering the aforementioned differences, it is obvious that an ASR system built for Standard Thai might not work optimally for Northern Thai. This prompts the development of the Northern Thai dialect speech recognition system in this study. We expect that incorporating the key linguistic features in the form of the tonal information and Northern Thai language model could improve recognition performance of the existing ASR systems that we are building upon.

4. Methodology

In this Section, we introduce the workflow of the ASR systems under investigation. Figure 3 summarises two configurations from input stage to output stage that we tested in the experiments. In the input stage, two options of corpus are possible: original and corrected. Next, auxiliary information in the form of tone can be incorporated into an HMM-based model. In the modelling stage, HMM-based model (Kaldi) and end-to-end model (XLS-R) will be considered. Language model can be employed in the decoding stage before we obtain the predicted transcription, from which the recognition performance of any ASR configuration can be measured. We now delve into the detail of the input stage (Section 4.1), HMM-based model (Section 4.2) and the end-to-end model (Section 4.3).

4.1. Northern Thai Dialect Corpus

The Northern Thai dialect speech corpus, which is apart from PaSCoNT [54], was collected from two hundred native speakers who have been living in Chiang Mai province for more than 18 years. The corpus is composed of the speech data from daily dialogues containing commands, requests, questions, narratives about daily talks, weather, travel, making merit, product prices, route inquiries, requests for help, general storytelling, such as songs, dramas, movies, singers and actors, and news [54]. The speakers are 100 males and 100 females of different age groups between 18 and 75 years old. Each speaker was assigned 100 Northern Thai dialect sentences. In total, 50 h of 20,000 speech files were recorded. The audio recordings were set to 44 kHz sample rate, 16-bits mono without any sound compression using Rode microphones. The microphone was kept 3 to 4 inches far from the speaker’s mouth. In this study, voices were recorded in both opened and closed environments.

After finishing the speech recording, all WAV files were carefully listened to by a native speaker of Northern Thai who was linguistically experienced in Northern Thai. Annotation was performed by the same linguist, who check the correspondence between the text transcription and the speech of each utterance and edit the text transcription to align with the recorded speech. This is necessary as some words could be accidentally omitted or inserted by the speakers during recording.

We note that the annotator was blinded to the identity of the speakers. The transcriptions were then randomly selected and verified by the authors. All text transcriptions of utterances were transcribed with the International Phonetic Alphabet (IPA) to express their pronunciation. In the phonetic transcription of utterances, the tonal annotation was integrated in words and a hyphen was inserted between syllables in words containing more than one syllable, to indicate syllable boundaries. Further details regarding data annotation and transcription are provided in [54]. Ultimately, we ended up with two speech corpora differing in the transcriptions: original versus cleaned. The transcription of the first corpus is taken directly from the text script given to the speakers. We refer to this corpus as original. The transcription of the second corpus was manually checked and corrected by the experts. We shall refer to this corpus as corrected. These two corpora will be used to investigate the impact of corpus quality on the ASR.

It is worth noting that Thai is an unsegmented language, which has no boundary marker between words. Moreover, the word segmentation in Northern Thai and Standard Thai are different in some words, such as “ya-pai” in Standard Thai meaning “ja-pai-pai” in Northern Thai. Therefore, the above corrected transcriptions were manually segmented by the expert. After that, the segmented transcriptions can be used for constructing language models. We also note that the IPA symbols were used as standard phoneme sets. From this, we have created the pronunciation dictionary (lexicon) of Northern Thai dialect, which consists of 450,674 words and 3329 vocabulary items.

4.2. Hidden Markov Model-Based ASR

The first ASR system is implemented using the Kaldi toolkit [44], which can accommodate both language modelling and acoustic modelling. The decoder is based on finite-state transducers (FSTs) and incorporates the language model, the pronunciation dictionary (lexicon), and context dependency into a single decoding graph.

In this study, we developed two types of acoustic models: one using a regular hidden Markov model (HMM) and the other using a DNN-HMM based on a time-delay neural network (TDNN) architecture [55,56,57]. This architecture constructs larger networks from sub-components in a modular and progressive manner. It has been demonstrated to efficiently learn the temporal dynamics of the signal from short-term feature representations. As a starting point, we employed the Gaussian mixture model (GMM) with a tri-state HMM topology with probability density functions tied through a phonetic decision tree. The 13 MFCCs plus pitch information serve as the input feature basis for the acoustic models. The detail of the training and the hyperparameter settings are described in Appendix A, Section Appendix A.1.

We utilized the SRILM toolkit [58] with a modified Kneser–Ney (KN) discounting [59] to train 3, 4, and 5-gram language models (LMs). KN is based on absolute-discounting interpolation, which reallocates some probability mass from 4-gram or 3-gram models to simpler unigrams. This smoothing technique constructs a lower-order distribution that is consistent with the smoothed higher-order distribution. The training data for LMs contains 450,674 words from the Northern Thai dialect speech corpus. Figure 4 presents the overview of the HMM-based model as implemented using the Kaldi toolkit.

4.3. Deep Neural Network-Based ASR

We employed five variations in deep neural networks with varying numbers of parameters and size of pre-training data. All of the models were trained in an end-to-end manner: that is the model receives a raw speech signal as input and outputs character-based transcriptions. The models under consideration are all based on the current state-of-the-art XLSR and XLS-R models. We note that XLSR and XLS-R share a similar network architecture. That is, they have the same settings for the number of blocks, the number of hidden states, the inner dimension of feed forward blocks, the number of attention heads, and the total number of parameters [36]. The difference between XLSR and XLS-R is the size and the number of languages in the training corpus. XLSR was trained on 50 h of public speech data in 53 languages, while XLS-R was trained on 436K hours of speech audio in 128 languages.

As such, we shall refer to both models as XLS-R. We shall be experimenting with the XLS-R of size 300 million and 1 billion parameters. Larger models generally yield better performance providing that there is sufficient data to train the model. In limited data as in our case, it remains to be investigated the effect of model size on the recognition performance.

To alleviate the negative effect of small data we will consider pre-training XLS-R on publicly available speech corpus of 53 languages and 128 languages. We also consider an XLS-R pre-trained from a closely related language, such as Standard Thai. This is to examine the degree to which pre-training can alleviate the data requirement of deep learning models.

Lastly, we also incorporate word-level N-gram language models of order 3 to 5 to further improve the transcription from the character-based end-to-end models. For fair comparison, the language models are the same ones as those used in the HMM-based ASR. To this end, we summarise all the end-to-end deep learning models in Table 1.

The detail of fine-tuning the models and the framework of a proposed end-to-end model (XLS-R) for the Northern Thai dialect is presented in Appendix A, Section Appendix A.2. Figure 5 illustrates the XLS-R model employed in this study.

5. Experiments

5.1. Experimental Setup

5.1.1. Datasets

The experiment utilises two variations in the Northern Thai corpus described in Section 4.1, namely original and corrected dataset. Each dataset contains 50 h of audio (.wav file) and corresponding transcriptions. The corpus is then randomly divided into two subsets: the training set and the test set resulting in 18,000 speech files for training and 2000 files for testing.

5.1.2. Evaluation Metric

We compare and evaluate the recognition performance of the ASR systems using the word error rate (WER) metric. The WER is defined as

W E R = \frac{(D + S + I)}{N} \times 100

(3)

where N is the number of words in the test sentence, D is the number of deletion errors in the test sentence, S is the number of substitution errors in the test sentence, and I is the number of insertion errors in the test sentence. To compute the WER, the reference sentences (ground truth) and the output text transcriptions from ASR are first tokenised into a list of tokens. By comparing the tokens of the two strings, the number of deletion (D), substitution (S), and insertion (I) for each sentence can be computed (We employed a software named jiwer available at https://jitsi.github.io/jiwer/ (accessed on 27 October 2023) to facilitate the calculation of WER).

We designed the experiments in order to investigate (1) the effects of the correctness of the transcription on the performance the ASR systems, (2) the effect of auxiliary information in the form of pitch information for the classical system and in the form of cross-language pre-training in the case of end-to end system, and finally (3) the effect of the language model for the end-to-end system. The results from these experiments may enable us to pick the most effective ASR for our purpose.

5.2. Results

5.2.1. Effect of Transcription Quality

First, we would like to study the effect of the quality of the transcription on the recognition performance of the ASR systems. As mentioned earlier, it was rather difficult to ensure that all of the volunteers read from the provided scripts perfectly without insertion of extra utterance or omission of some. As a result, the original script may not correspond one-to-one to the recorded acoustic signal. This can be regarded as mild noisy supervision in supervised learning literature. To see the effect of possibly noisy transcription, we train all of the ASR models using two sets of data. The first one, named original, takes the original script as the transcription while the second one, named corrected uses the manually transcribed script as the transcription. The number of words inserted (I), deleted (D), or substituted during the manual transcription process are 182, 185, and 21,262, respectively.

Table 2 summarises the results obtained from various configurations of HMM-based systems. Without considering the effect of noisy transcription, we first observe that the results agree with the general understanding of the classical ASR models in the sense that the more complicated acoustic model e.g., TDNN-HMM, often yields better WER. Likewise, the more complicated the acoustic features are, the better WER we can expect. This is already a known result, but it serves well as a checkpoint that our implementation of the recipe is in the right direction. We can also observe that using higher order language models also contribute to better WER, e.g., 5.98 for TDNN-HMM with 3-grams as compared to 5.81 for TDNN-HMM with 5-grams.

To test if the performance difference between a group of models trained on the original data and those trained on the corrected data is statistically significant, we employed a two-way analysis of variance (two-way ANOVA) as a statistical tool to examine the impact of transcription quality on the ASR performance. A p-value of less than 0.001 from the ANOVA test indicates that the average word error rate difference between the two groups was statistically significant. It is also evident that the quality of the corpus has an impact on the WER of the classical HMM-based systems. We can see that the WERs of the systems trained on the original script are almost always higher than those trained from the manually corrected transcription. This is not surprising as the phenomenon resembles learning from noisy labels in other supervised learning tasks [60]. And we may as well conclude that the quality of the transcription does affect the performance of the ASR models.

Interestingly, the same conclusion cannot be drawn in the case of the end to-end systems. Table 3 summarises the results from some of the end-to-end models. One-way analysis of variance was employed to test the null hypothesis that training either on the original or corrected data will yield similar results. One-way ANOVA returned a p-value of 0.328, indicating that the null hypothesis cannot be rejected, and that the average word error rate would not be affected by the mildly noisy transcription in the case of end-to-end model. From the results we can see that the quality of the corpus did not have an impact on the XLS-R models in almost all cases except for XLS-R-300-53, which is the smallest model in our study. We speculate that the mild noisy transcription plays the regularisation role in the training of the deep learning model and improves the generalisation performance. Similar observation can be witnessed in the case of the general classification tasks as well [61].

5.2.2. Effect of Additional Tonal Information

We have seen from Table 2 that the set of acoustic features does affect the performance of HMM-based ASR systems. Since Northern Thai is a tonal language consisting of six different tones, we proceed to investigate whether incorporating tonal information (or pitch information) into the model would improve the overall recognition performance. Again, a p-value of less than 0.001 from the two-way ANOVA test indicates that the average word error rate difference between the HMM-based model with and without pitch information was statistically significant. The comparative results of HMM-based models with and without pitch information are summarised in Table 4. Note that some of the results are quoted from Table 2.

We notice that the HMM-based ASR models that considered the pitch information produced a lower WER than those that did not consider the pitch information in all configurations. Overall, we may conclude that the traditional approach as represented by the TDNN-HMM model using pitch information, trained on the transcription data with a 5-grams LM, produced the best WER of 4.09. We note that the analysis in this Section is only relevant to the HMM-based ASR and not to the end-to-end ASR. This is because the writing system of the Northern Thai Language explicitly includes tonal information in the form of tone characters. The end-to-end ASR is then by default trained with pitch information embedded in the transcriptions, i.e., the output labels.

5.2.3. Effect of Model Pre-Training

We now consider a more modern approach to ASR using the deep neural networks trained in the end-to-end fashion. It is agreeable that deep learning models are quite data intensive, and when applied to low-resource ASR tasks, it might not work well as compared to the traditional HMM-based models. One way to overcome the limitation on corpus size is to rely on the pre-training technique, which pre-train the model on other languages so that the network learns the latent representation of acoustic signals shared across a wide variety of languages. It is then reasonable to ask how the number and the size of the pre-train corpus affect the recognition performance of the downstream task, i.e., the Northern Thai dialect recognition task.

To quantify the effect, we tested a set of XLS-R-based models each trained on different pre-train corpus varying in size and the extensiveness of the languages. The results are summarised in Table 5.

The results showed that the XLS-R with 300 million parameters (XLS R-300-53-TH) that was pre-trained on aithai corpus, which is a speech corpus, consists of 255 h Thai speech embedded within the common voice corpus 7.0 available at https://commonvoice.mozilla.org/en/datasets (accessed on 29 May 2024), produced a lower WER than other XLS-Rs of similar size. It is better than XLS-R-300-53 trained on 53 languages that does not include Thai. It is also superior to XLS-R-300-128, which was trained on 128 languages, of which only 57 h were Thai speech. This signifies the fact that pre-training the model using languages from the same family is preferable.

5.2.4. Effect of Language Model

Our preliminary results suggest that the output from XLS-R, which works in a character-based manner, i.e., it outputs one character at a time, can be further improved with the help of a language model. For this purpose, we employed n-gram LMs of different orders and presented the results in Table 6.

The results showed that the LM helps improve the recognition performance of the ASR model compared to that without LM. However, the order of the n-gram model did not have a significant impact on the WER. Nonetheless, the XLS-R-1000-128 with a 5-grams LM can reduce the WER from 1.53 to 0.94. This suggests that having the n-gram model on top of the XLS-R, during CTC beam search decoding, is beneficial. Similar results can also be observed in the case of HMM-based ASR, as shown in Table 4.

5.2.5. Discussion

By incorporating all of the ingredients that contribute to the better performance of each approach, we are in a good position to compare the two approaches to ASR for Northern Thai dialect speech recognition. It is clear from the presented empirical results that both approaches with a 5-grams LM yield the best WER. However, WER from the HMM-based approach is still higher than that of the end-to-end approach (4.09 versus 0.94). Having seen that the performance difference is quite significant, it is worth inspecting the actual errors, which are the number of deletions, substitutions, and insertion errors in the test sentences incurred by the two approaches in Table 7.

The results showed that the HMM-based model made more mistakes by substitution and deletion than the end-to-end approach. We found that the TDNN-HMM tended to transcribe the phoneme precisely as it was pronounced, although the pronunciation did not match the ground truth transcription and so it was regarded as an error. We speculate that this is because the acoustic model in the classical ASR works in isolation with the other components during testing. And that the acoustic model seemed to have a greater impact on the prediction as it sits in the first step of the ASR pipeline. On the other hand, the XLS-R model transcribed the phoneme by considering the acoustic features and language features simultaneously. We present some illustrating example sentences from the corrected corpus in Figure 6.

Moreover, the classical ASR tends to omit words from the sentence (deletion error). This could stem from the size of the training corpus such that the resulting LM was not extensive enough to cover language patterns in the test data. Meanwhile, the XLS-R incurs less whole-word deletion error. We conducted a further analysis of the substitution errors presented in Table 7 to identify specific error patterns, including tone misrecognition, vowel confusion, character confusion, and word misrecognition. We found that the most common error patterns of XLS-R model are tone misrecognitions and character confusions with the percentage of 40.29 and 32.58, while the error patterns of the HMM-based model are mostly on word misrecognitions and vowel confusions with the percentage of 38.01 and 25.98.

In summary, XLS-R still suffers from character-level transcription errors. This is expected though since XLS-R predicts words in a character level so that some characters can be missing from the transcribed word.

As a final remark, the end-to-end ASR demonstrated significantly better recognition performance under the tested configurations, and we see that the end-to-end approach is less sensitive to noisy transcription. This is desirable from a data collection standpoint in the sense that we can quickly collect large amounts of training data without having to ensure a perfect match between the speech signal and its transcription. Further, the ability to attain better WER by pre-training the model from closely related higher-resource languages is another attractive feature of the end-to-end ASR system from a low-resource language perspective.

The primary advantage lies in XLS-R’s self-supervised pre-training approach. By learning from massive amounts of multilingual speech data, including tonal languages like Mandarin, XLS-R develops acoustic representations that naturally capture the subtle phonetic variations present in dialectal speech. This data-driven feature learning proves far more effective than the hand-crafted acoustic features used in our TDNN-HMM baseline, even when augmented with carefully engineered i-vectors and pitch features. The model essentially learns to represent the acoustic space in a way that is inherently robust to the systematic variations that characterise different dialects.

6. Conclusions

We studied two major approaches to ASR for the low-resource Northern Thai dialect. The traditional approach was implemented based on the Kaldi toolkit while we adopted the XLS-R model for the end-to-end approach. We investigated the necessary ingredients that contribute to the performance of the system namely, corpus quality, tonal information, model pre-training and the augmentation of language model. From the experimental results, we found that (1) transcription quality did improve the performance of the HMM-based systems but not those of the end-to-end systems. (2) tonal information can be employed to improve the recognition performance of the HMM-based system. (3) pre-training the end-to-end approach from closely related languages generally helps with the performance of the end to-end systems. (4) auxiliary information in the form of a language model also improves the performance of the end-to-end systems. Ultimately, we learned that an XLS-R-based end-to-end system pre-trained from a Standard Thai speech corpus with 5-grams LM, trained from mildly noisy corpus yields the best performance of 0.94 WER.

In future work, we plan to pursue the research in the direction of personalised ASR systems. This can be achieved using a dynamic language model, which adapts to the context and user’s personality. Auxiliary information from speech signals, such as the entropy-based feature [62], will become useful for differentiating a speaker’s personality. Emotion embedded in the speech, which can be detected based on Wavelets [63] representation of the signal, can also be employed to make the language model more adaptive.

Author Contributions

Conceptualisation, J.B., P.T., J.C., S.T. and V.C.; data collection and corpus design, P.T. and S.T.; methodology, V.C., J.B. and P.T.; data curation, P.T.; implementation, experiments and data analysis, J.B. and P.T.; software, P.I., A.B. and B.J.M.S.; writing—original draft preparation, J.B., P.T., S.T. and V.C.; writing—review and editing, J.B., P.T. and P.I.; supervision: P.T.; project administration and funding acquisition, P.T. and J.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the NSRF via the Program Management Unit for Human Resources and Institutional Development, Research and Innovation [grant number B04G640073].

Institutional Review Board Statement

This project was reviewed and approved by the Chiang Mai University Research Ethics Committee.

Informed Consent Statement

Informed consent was obtained from all subjects involved in the study.

Data Availability Statement

The Northern Thai Dialect speech corpus will be made available upon request.

Acknowledgments

This research has received funding support from the NSRF via the Program Management Unit for Human Resources and Institutional Development, Research and Innovation [grant number B04G640073]. This research was also supported by Chiang Mai University and the National Science and Technology Development Agency (NSTDA).

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

Appendix A.1

In the HMM-based ASR, the recipe starts with a monophone model to acquire the monophone alignments for training the first ∆ + ∆ triphone model. The second triphone model splices together seven feature vectors through a LDA transform as its input and adjusts the model through a MLLT. The LDA distinguishes feature vectors from various classes by decreasing the dimension. This reduced feature space is used to build a transform specific to each speaker. The MLLT takes the reduced feature space from the LDA and derives a unique transformation for each speaker. MLLT can be viewed as a feature-space transform, which is a model state transformation technique for estimating a global covariance matrix. It allows a minimal number of full covariance matrices to be shared across GMM distributions [24]. In this study, we employ 35 Viterbi training iterations while periodically changing the MLLT transformation matrix four times. The third triphone models integrate speaker adaptive training (SAT) [64] with LDA and MLLT to perform speaker and noise normalisation by adapting to each specific speaker with a particular data transform. SAT aims to compensate for speaker differences in the training set by normalising the impacts of variation in the acoustic properties of different speakers when training the acoustic models. It essentially ignores inter-speaker variability and instead models intra-speaker variability. Individual speaker characteristics are modelled by linear transformations of the mean parameters of the acoustic models.

We used lattice-free maximum mutual information (LF-MMI) training [65] to build an acoustic model based on a TDNN baseline as part of DNN-HMM. LF-MMI trains the neural network using a sentence-level posterior based on alignments from a DNN-HMM model. The DNN is used to represent the HMM’s observation likelihood. Given that we aim to construct context-dependent phone recognition, we will need to first do an alignment pass with a GMM-HMM system to generate output labels for the neural network. To extract acoustic data, a 25-millisecond frame length of the speech waveform with a 10-millisecond window shift was subjected to spectral analysis. Non-adapted, non-normalised 40-dimensional MFCCs are spliced for a window of three frames to the left and right, plus pitch information and a 100-dimensional i-vector [66] indicating speaker characteristics. The network is made up of nine layers of TDNN, each having 768 nodes. All layers employ a ReLU activation function with batch normalisation, with the final layer employing the softmax activation function. The splicing indexes used are -1,0,1-1,0,1-3,0,3-3,0,3-6,-3,0.

Appendix A.2

For XLS-R, we fine-tuned each model for 10,000 steps with a 500-step warm-up, at which point the model sufficiently converged. The feature extractor was frozen throughout the fine-tuning phase. AdamW optimizer [67] was employed to optimise the CTC loss with the empirically chosen parameters: β₁ = 0.9, β₂ = 0.999, ϵ = 1 × 10⁻⁸. The learning rate was set to 1 × 10⁻⁵ and the dropout rate was 0.1. All other hyper-parameters were left at their default values. We used the above hyper-parameter settings for all of the experiments.

References

Davis, K.; Biddulph, R.; Balashek, S. Automatic recognition of spoken digits. J. Acoust. Soc. Am. 1952, 24, 637–642. [Google Scholar] [CrossRef]
Schneider, S.; Baevski, A.; Collobert, R.; Auli, M. wav2vec: Unsupervised pre-training for speech recognition. arXiv 2019, arXiv:1904.05862. [Google Scholar]
Radford, A.; Kim, J.W.; Xu, T.; Brockman, G.; McLeavey, C.; Sutskever, I. Robust speech recognition via large-scale weak supervision. arXiv 2022, arXiv:2212.04356. [Google Scholar] [CrossRef]
Lecouteux, B.; Vacher, M.; Portet, F. Distant speech recognition in a smart home: Comparison of several multisource ASRs in realistic conditions. In Proceedings of the 12th Annual Conference of the International Speech Communication Association (ISCA), Florence, Italy, 27–31 August 2011; pp. 2273–2276. [Google Scholar]
Beh, J.; Baran, R.H.; Ko, H. Dual channel-based speech enhancement using novelty filter for robust speech recognition in automobile environment. IEEE Trans. Consum. Electron. 2006, 52, 583–589. [Google Scholar] [CrossRef]
Johnson, M.; Lapkin, S.; Long, V.; Sanchez, P.; Suominen, H.; Basi lakis, J.; Dawson, L. A systematic review of speech recognition technology in health care. BMC Med. Inform. Decis. Mak. 2014, 14, 94. [Google Scholar] [CrossRef] [PubMed]
Yamazaki, T.; Yoshikawa, K.; Kawamoto, T.; Ohagi, M.; Mizumoto, T.; Ichimura, S.; Kida, Y.; Sato, T. Tourist Guidance Robot Based on HyperCLOVA. arXiv 2022, arXiv:2210.10400. [Google Scholar]
Suwanbandit, A.; Naowarat, B.; Sangpetch, O.; Chuangsuwanich, E. Thai Dialect Corpus and Transfer-based Curriculum Learning Investigation for Dialect Automatic Speech Recognition. In Proceedings of the Interspeech, Dublin, Ireland, 20–24 August 2023; pp. 4069–4073. [Google Scholar] [CrossRef]
Chaiwongsai, J.; Chiracharit, W.; Chamnongthai, K.; Miyanaga, Y. An architecture of HMM-based isolated-word speech recognition with tone detection function. In Proceedings of the 2008 International Symposium on Intelligent Signal Processing and Communications Systems, Bangkok, Thailand, 15–17 August 2009; pp. 1–4. [Google Scholar]
Kantithammakorn, P.; Punyabukkana, P.; Pratanwanich, P.N.; Hem rungrojn, S.; Chunharas, C.; Wanvarie, D. Using Automatic Speech Recognition to Assess Thai Speech Language Fluency in the Montreal Cognitive Assessment (MoCA). Sensors 2022, 22, 1583. [Google Scholar] [CrossRef] [PubMed]
Kamath, U.; Liu, J.; Whitaker, J. Deep Learning for NLP and Speech Recognition; Springer: Cham, Switzerland, 2019; Volume 84. [Google Scholar]
Nassif, A.B.; Shahin, I.; Attili, I.; Azzeh, M.; Shaalan, K. Speech recognition using deep neural networks: A systematic review. IEEE Access 2019, 7, 19143–19165. [Google Scholar] [CrossRef]
Qin, S.; Wang, L.; Li, S.; Dang, J.; Pan, L. Improving low-resource Tibetan end-to-end ASR by multilingual and multilevel unit modeling. J. Audio Speech Music. Process. 2022, 2022, 2. [Google Scholar] [CrossRef]
Bekarystankyzy, A.; Mamyrbayev, O.; Mendes, M.; Fazylzhanova, A.; Assam, M. Multilingual end-to-end ASR for low-resource Turkic languages with common alphabets. Sci. Rep. 2024, 14, 13835. [Google Scholar] [CrossRef]
Diwan, A.; Jyothi, P. Reduce and Reconstruct: ASR for Low-Resource Phonetic Languages. In Proceedings of the 12th annual conference of the International Speech Communication Association (ISCA), Brno, Czech Republic, 30 August–3 September 2021; pp. 3445–3449. [Google Scholar]
Yue, X.; Lee, G.; Yılmaz, E.; Deng, F.; Li, H. End-to-End Code-Switching ASR for Low Resourced Language Pairs. In Proceedings of the 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Singapore, 14–18 December2019; pp. 972–979. [Google Scholar]
Duda, R.O.; Hart, P.E. Pattern Classification and Scene Analysis; John Willey & Sons: New York, NY, USA, 1973. [Google Scholar]
Jelinek, F. Statistical Methods for Speech Recognition; MIT Press: Cambridge, MA, USA, 1998. [Google Scholar]
Rabiner, L.R.; Juang, B.-H.; Lee, C.-H. An Overview of Automatic Speech Recognition. In Automatic Speech and Speaker Recognition: Advanced Topics; Lee, C.-H., Soong, F.K., Paliwal, K.K., Eds.; Springer: Boston, MA, USA, 1996; pp. pp 1–30. [Google Scholar]
Davis, S.; Mermelstein, P. Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Se. IEEE Trans. Acoust. 1980, 28, 357–366. [Google Scholar] [CrossRef]
Haeb-Umbach, R.; Ney, H. Linear discriminant analysis for improved large vocabulary continuous speech recognition. In Proceedings of the 1992 IEEE International Conference on Acoustics, Speech and Signal Processing, San Francisco, CA, USA, 23–26 March 1992; pp. 13–16. [Google Scholar]
Gopinath, R.A. Maximum likelihood modeling with Gaussian distributions for classification. In Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP’98, Seattle, WA, USA, 12–15 May 1998; pp. 661–664. [Google Scholar]
Gales, M.J.F. Semi-tied covariance matrices for hidden Markov models. IEEE Trans. Speech Audio Process. 1999, 7, 272–281. [Google Scholar] [CrossRef]
Gales, M.J.F. Maximum likelihood linear transformations for HMM-based speech recognition. Comput. Speech Lang. 1998, 12, 75–98. [Google Scholar] [CrossRef]
Rabiner, L.R. A tutorial on hidden Markov models and selected applications in speech recognition. Proc. IEEE 1989, 77, 257–286. [Google Scholar] [CrossRef]
Gales, M.; Young, S. Application of Hidden Markov Models in Speech Recognition; Found. Trends Signal Process. 2008, 1, 195–304. [Google Scholar] [CrossRef]
Bourlard, H.A.; Morgan, N. Connectionist Speech Recognition: A Hybrid Approach; Kluwer Academic Publishers: New York, NY, USA, 1993. [Google Scholar]
Bansal, P.; Kant, A.; Kumar, S.; Sharda, A.; Gupta, S. Improved hybrid model of HMM/GMM for speech recognition. Intell. Technol. Appl. 2008, 2, 69–74. [Google Scholar]
Rasipuram, R.; Magimai-Doss, M. Acoustic and lexical resource constrained ASR using language-independent acoustic model and language-dependent probabilistic lexical model. Speech Commun. 2015, 68, 23–40. [Google Scholar] [CrossRef]
Clarkson, P.; Rosenfeld, R. Statistical language modeling using the CMU-cambridge toolkit. In Proceedings of the 5th European Conference on Speech Communication and Technology, Rhodes, Greece, 22–25 September 1997; pp. 2707–2710. [Google Scholar]
Baevski, A.; Zhou, H.; Mohamed, A.; Auli, M. Wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations. In Proceedings of the 34th International Conference on Neural Information Processing Systems, Red Hook, NY, USA, 6–12 December 2020; pp. 12449–12460. [Google Scholar]
Stefanel Gris, L.R.; Casanova, E.; de Oliveira, F.S.; da Silva Soares, A.; Candido Junior, A. Brazilian portuguese speech recognition using wav2vec 2.0. In Proceedings of the International Conference on Computational Processing of the Portuguese Language, Fortaleza, Brazil, 21–23 March 2022; pp. pp 333–343. [Google Scholar]
Showrav, T.T. An Automatic Speech Recognition System for Bengali Language based on Wav2Vec2 and Transfer Learning. arXiv 2022, arXiv:2209.08119. [Google Scholar] [CrossRef]
Wongpatikaseree, K.; Singkul, S.; Hnoohom, N.; Yuenyong, S. Real Time End-to-End Speech Emotion Recognition with Cross-Domain Adaptation. Big Data Cogn. Comput. 2022, 6, 79. [Google Scholar] [CrossRef]
Conneau, A.; Baevski, A.; Collobert, R.; Mohamed, A.; Auli, M. Unsupervised cross-lingual representation learning for speech recognition. arXiv 2020, arXiv:2006.13979. [Google Scholar] [CrossRef]
Babu, A.; Wang, C.; Tjandra, A.; Lakhotia, K.; Xu, Q.; Goyal, N.; Singh, K.; von Platen, P.; Saraf, Y.; Pino, J.; et al. XLS-R: Self supervised cross-lingual speech representation learning at scale. arXiv 2021, arXiv:2111.09296. [Google Scholar]
Wang, D.; Zheng, T.F. Transfer learning for speech and language processing. In Proceedings of the 2015 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA), Hong Kong, China, 16–19 December 2015; pp. 1225–1237. [Google Scholar]
Kunze, J.; Kirsch, L.; Kurenkov, I.; Krug, A.; Johannsmeier, J.; Stober, S. Transfer learning for speech recognition on a budget. arXiv 2017, arXiv:1706.00290. [Google Scholar] [CrossRef]
Qin, C.-X.; Qu, D.; Zhang, L.-H. Towards end-to-end speech recognition with transfer learning. Eurasip J. Audio Speech Music Process. 2018, 2018, 18. [Google Scholar] [CrossRef]
Kornblith, S.; Shlens, J.; Le, Q.V. Do better imagenet models transfer better? In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 2661–2671. [Google Scholar]
Thimmaraja, Y.G.; Jayanna, H.S. Creating language and acoustic models using Kaldi to build an automatic speech recognition system for Kannada language. In Proceedings of the 2017 2nd IEEE International Conference on Recent Trends in Electronics, Information & Communication Technology (RTEICT), Bangalore, India, 19–20 May 2017; pp. 161–165. [Google Scholar]
Babu, L.B.; George, A.; Sreelakshmi, K.R.; Mary, L. Continuous Speech Recognition System for Malayalam Language Using Kaldi. In Proceedings of the 2018 International Conference on Emerging Trends and Innovations in Engineering and Technological Research (ICETIETR), Ernakulam, India, 11–13 July 2018; pp. 1–4. [Google Scholar]
Guglani, J.; Mishra, A. Continuous Punjabi speech recognition model based on Kaldi ASR toolkit. Int. J. Speech Technol. 2018, 21, 211–216. [Google Scholar] [CrossRef]
Povey, D.; Ghoshal, A.; Boulianne, G.; Burget, L.; Glembek, O.; Goel, N.; Hannemann, M.; Motlíček, P.; Qian, Y.; Schwarz, P.; et al. The Kaldi speech recognition toolkit. In Proceedings of the IEEE 2011 Workshop on Automatic Speech Recognition and Understanding, Waikoloa, HI, USA, 11–15 December 2011. [Google Scholar]
Zhang, F.; Xie, X.; Quan, X. Chinese Dialect Speech Recognition Based on End-to-end Machine Learning. In Proceedings of the 2022 International Conference on Machine Learning, Control, and Robotics (MLCR), Suzhou, China, 29–31 October 2022; pp. 14–18. [Google Scholar]
Boberg, C.; Nerbonne, J.; Watt, D. The Handbook of Dialectology; John Wiley & Sons: Hoboken, NJ, USA, 2018. [Google Scholar]
L-Thongkum, T. Thai Sounds: An Acoustic Study; Centenary academic development project; Chulalongkorn University: Bangkok, Thailand, 2011. [Google Scholar]
Naksakul, K. Thai Sound System, 8th ed.; Academic publishing project; Faculty of Arts, Chulalongkorn University: Bangkok, Thailand, 2016. [Google Scholar]
Tingsabadh, M.R.K.; Abramson, A.S. Thai. J. Int. Phon. Assoc. 1993, 23, 24–28. [Google Scholar] [CrossRef]
Nanthakanok, K. Variation of Pronunciation of the Initial Consonant [n] in Northern Thai Dialect by Social Factors. J. Lang. Cult. Relig. 2019, 8, 16–32. [Google Scholar]
Saekho, Y. The Northern Thai Dictionary; Chiang Mai Rajab hat Institute: Chiang Mai, Thailand, 1996; pp. 1–6. [Google Scholar]
Teekachunhatean, R. Thai Dialests; Department of Thai, Faculty of Humanities, Chiang Mai University: Chiang Mai, Thailand, 2006. [Google Scholar]
Kannika, V. Northern Thai Dialect, 3rd ed.; Department of Oriental Languages, Faculty of Archeology, Silpakorn University: Bangkok, Thailand, 2012. [Google Scholar]
Taerungruang, S.; Taninpong, P.; Chunwijitra, V.; Thatphithakkul, S.; Kasuriya, S.; Inthanon, V.; Paksaranuwat, P.; Thumronglaohapun, S.; Inkeaw, N.N.P.; Bootkrajang, J. PaSCoNT- Parallel Speech Corpus of Northern-central Thai for automatic speech recognition. Comput. Speech Lang. 2025, 89, 101692. [Google Scholar] [CrossRef]
Waibel, A.H. Modular Construction of Time-Delay Neural Networks for Speech Recognition. Neural Comput. 1989, 1, 39–46. [Google Scholar] [CrossRef]
Waibel, A.H.; Hanazawa, T.; Hinton, G.E.; Shikano, K.; Lang, K.J. Phoneme recognition using time-delay neural networks. IEEE Trans. Acoust. Speech Signal Process. 1989, 37, 328–339. [Google Scholar] [CrossRef]
Peddinti, V.; Povey, D.; Khudanpur, S. A time delay neural network architecture for efficient modeling of long temporal contexts. In Proceedings of the 16th Annual Conference of the International Speech Communication Association (ISCA), Dresden, Germany, 6–10 September 2015; pp. 3214–3218. [Google Scholar]
Stolcke, A. Srilm- An Extensible Language Modeling Toolkit. In Proceedings of the 7th International Conference on Spoken Language Processing (ICSLP 2002), Denver, CO, USA, 16–20 September 2002; pp. 901–904. [Google Scholar]
James, F.A. Modified Kneser-Ney Smoothing of n-Gram Models; Research Inst. For Advanced Computer Science: Moffett Field, CA, USA, 2000. [Google Scholar]
Bootkrajang, J.; Chaijaruwanich, J. Towards an improved label noise proportion estimation in small data: A Bayesian approach. Int. J. Mach. Learn. Cybern. 2022, 13, 851–867. [Google Scholar] [CrossRef]
Wei, H.; Tao, L.; Xie, R.; An, B. Open-set label noise can improve robustness against inherent label noise. Adv. Neural Inf. Process. Syst. 2021, 34, 7978–7992. [Google Scholar]
Guido, R.C. A tutorial review on entropy-based handcrafted feature extraction for information fusion. Inf. Fusion 2018, 41, 161–175. [Google Scholar] [CrossRef]
Guido, R.C. Wavelets behind the scenes: Practical aspects, insights, and perspectives. Phys. Rep. 2022, 985, 1–23. [Google Scholar] [CrossRef]
Anastasakos, T.; McDonough, J.; Makhoul, J. Speaker adaptive training: A maximum likelihood approach to speaker normalization. In Proceedings of the 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing, Munich, Germany, 21–24 April 1997; pp. 1043–1046. [Google Scholar]
Povey, D.; Peddinti, V.; Galvez, D.; Ghahremani, P.; Manohar, V.; Na, X.; Wang, Y.; Khudanpur, S. Purely Sequence-Trained Neural Networks for ASR Based on Lattice-Free MMI. In Proceedings of the 17th annual conference of the International Speech Communication Association (ISCA), San Francisco, CA, USA, 8–12 September 2016; pp. 2751–2755. [Google Scholar]
Dehak, N.; Kenny, P.J.; Dehak, R.; Dumouchel, P.; Ouellet, P. Front End Factor Analysis for Speaker Verification. IEEE Trans. Audio Speech Lang. Process. 2011, 19, 788–798. [Google Scholar] [CrossRef]
Loshchilov, I.; Hutter, F. Decoupled weight decay regularization. arXiv 2017, arXiv:1711.05101. [Google Scholar]

Figure 1. The pipeline of the classical automatic speech recognition.

Figure 2. The deep learning approach to automatic speech recognition (end-to-end).

Figure 3. A graph representing the workflow of this study. There are five stages, including input, auxiliary information, model, decoding and output stage. A path corresponds to one possible configuration of the ASR systems investigated in this study.

Figure 4. The overview of the HMM-based model for Northern Thai dialect ASR. In our setting, an auxiliary information in the form of tone (or pitch) was used to improve the recognition performance. The DNN-HMM is based on a TDNN architecture.

Figure 5. The overview of the proposed end-to-end model (XLS-R) for the Northern Thai dialect. The model is pre-trained from Standard Thai and is augmented with language model in the decoding stage.

Figure 6. Examples of transcribed results from the two best performing configurations of the two approaches together with the ground truths (GT).

Table 1. The variations in XLS-R models employed in this study.

Model Name	# of Parameters	Pre-Training Corpus	Thai Corpus
XLS-R-300-53	300 million	56k hrs of 53 languages	0
XLS-R-300-53-TH	300 million	56k hrs of 53 languages	255 h
XLS-R-300-128	300 million	436k hrs of 128 languages	57 h
XLS-R-1000-128	1000 million	436k hrs of 128 languages	57 h
XLS-R-1000-128-LM	1000 million	436k hrs of 128 languages	57 h

Table 2. Comparison of word error rates (WER) of the classical HMM-based ASR learned from noisy corpus (original) and manually transcribed corpus (corrected).

Acoustic Model	Features	Original			Corrected
Acoustic Model	Features	3-grams	4-grams	5-grams	3-grams	4-grams	5-grams
GMM-HMM	∆ + ∆ triphone	11.96	11.86	11.87	8.66	8.06	8.01
GMM-HMM	LDA + MLLT	10.31	10.22	10.24	7.54	7.05	7.08
GMM-HMM	LDA + MLLT + SAT	8.24	8.17	8.18	6.12	5.70	5.73
TDNN-HMM	MFCC + iVector	5.98	5.98	5.81	4.98	4.49	4.53

Table 3. Comparison of word error rates (WER) of variations in XLS-R models trained on the original (possibly noisy) corpus and the manually corrected corpus.

Model	Original	Corrected
XLS-R-300-53 [17]	4.83	4.67
XLS-R-300-128 [18]	3.36	3.77
XLS-R-1000-128 [18]	1.49	2.37

Table 4. Comparison of WER of the classical HMM-based ASR learned with and without pitch information. We employed the corrected transcription in this experiment.

Acoustic Model	Features	Original			Corrected
Acoustic Model	Features	3-grams	4-grams	5-grams	3-grams	4-grams	5-grams
GMM-HMM	∆ + ∆ triphone	8.66	8.06	8.01	8.09	7.60	7.57
GMM-HMM	LDA + MLLT	7.54	7.05	7.08	7.34	6.89	6.89
GMM-HMM	LDA + MLLT + SAT	6.12	5.70	5.73	6.01	5.61	5.64
TDNN-HMM	MFCC + iVector	4.98	4.49	4.53	4.38	4.15	4.09

Table 5. Comparison of WER of variations in XLS-R models pre-trained on different corpus.

Model	Pre-Train Languages	Thai Corpus Size	WER
XLS-R-300-53 [17]	53 languages	0 h	4.83
XLS-R-300-128 [18]	128 languages	57 h	3.36
XLS-R-300-53-TH	53 languages + aithai	255 h	2.45

Table 6. Comparison of word error rates (WER) of variations in XLS-R models with and without n-gram LM.

Model	Language Model	WER
XLS-R-1000-128	No LM	1.53
XLS-R-1000-128	3-grams	0.96
XLS-R-1000-128	4-grams	0.95
XLS-R-1000-128	5-grams	0.94

Table 7. The number of errors that the end-to-end model and the HMM-based model with 3-grams LM made during transcription of the test data (corrected).

Types of Error	TDNN-HMM	XLS-R
Deletion	327	142
Substitution	1505	623
Insertion	155	114
Total	1987	879

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Bootkrajang, J.; Inkeaw, P.; Chaijaruwanich, J.; Taerungruang, S.; Boonyawisit, A.; Sutawong, B.J.M.; Chunwijitra, V.; Taninpong, P. The Development of Northern Thai Dialect Speech Recognition System. Appl. Sci. 2026, 16, 160. https://doi.org/10.3390/app16010160

AMA Style

Bootkrajang J, Inkeaw P, Chaijaruwanich J, Taerungruang S, Boonyawisit A, Sutawong BJM, Chunwijitra V, Taninpong P. The Development of Northern Thai Dialect Speech Recognition System. Applied Sciences. 2026; 16(1):160. https://doi.org/10.3390/app16010160

Chicago/Turabian Style

Bootkrajang, Jakramate, Papangkorn Inkeaw, Jeerayut Chaijaruwanich, Supawat Taerungruang, Adisorn Boonyawisit, Bak Jong Min Sutawong, Vataya Chunwijitra, and Phimphaka Taninpong. 2026. "The Development of Northern Thai Dialect Speech Recognition System" Applied Sciences 16, no. 1: 160. https://doi.org/10.3390/app16010160

APA Style

Bootkrajang, J., Inkeaw, P., Chaijaruwanich, J., Taerungruang, S., Boonyawisit, A., Sutawong, B. J. M., Chunwijitra, V., & Taninpong, P. (2026). The Development of Northern Thai Dialect Speech Recognition System. Applied Sciences, 16(1), 160. https://doi.org/10.3390/app16010160

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

Article Menu

The Development of Northern Thai Dialect Speech Recognition System

Abstract

1. Introduction

2. Background and Related Work

2.1. Classical ASR

2.1.1. Feature Extraction

2.1.2. Acoustic Modelling

2.1.3. Language Modelling

2.2. Deep Learning-Based End-to-End ASR

2.3. Speech Recognition for Dialects

3. Northern Thai Dialect

3.1. Sound System

3.2. Lexical Items

4. Methodology

4.1. Northern Thai Dialect Corpus

4.2. Hidden Markov Model-Based ASR

4.3. Deep Neural Network-Based ASR

5. Experiments

5.1. Experimental Setup

5.1.1. Datasets

5.1.2. Evaluation Metric

5.2. Results

5.2.1. Effect of Transcription Quality

5.2.2. Effect of Additional Tonal Information

5.2.3. Effect of Model Pre-Training

5.2.4. Effect of Language Model

5.2.5. Discussion

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Appendix A

Appendix A.1

Appendix A.2

References

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI