Lithuanian Speech Recognition Using Purely Phonetic Deep Learning

: Automatic speech recognition (ASR) has been one of the biggest and hardest challenges in the ﬁeld. A large majority of research in this area focuses on widely spoken languages such as English. The problems of automatic Lithuanian speech recognition have attracted little attention so far. Due to complicated language structure and scarcity of data, models proposed for other languages such as English cannot be directly adopted for Lithuanian. In this paper we propose an ASR system for the Lithuanian language, which is based on deep learning methods and can identify spoken words purely from their phoneme sequences. Two encoder-decoder models are used to solve the ASR task: a traditional encoder-decoder model and a model with attention mechanism. The performance of these models is evaluated in isolated speech recognition task (with an accuracy of 0.993) and long phrase recognition task (with an accuracy of 0.992).


Introduction
Endangered languages are an important part of the world's cultural heritage. Preservation of endangered languages may require speech recording, processing, and automatic recognition. Since the early days of modern computer science, automatic speech recognition (ASR) has been one of the biggest and hardest challenges in the field that requires huge volumes of speech data [1]. Over the years, a large majority of research conducted in this area focused on the most widely spoken languages, such as English, French, Mandarin Chinese, etc. [2,3]. Large corpora of speech data such as Librispeech [4] for English and AISHELL-1 [5] for Mandarin are available. End-to-end platforms for ASR and other speech-related tasks (speech-to-text apps, voice command interfaces, etc.) have been developed [6,7].
However, the so-called under-resourced languages have less speakers and, correspondingly, there are less spoken language resources available for the research [8]. One example is the Lithuanian language, one of the Baltic languages with about 3.2 million speakers, which received very little attention from researchers. This was partially due to complicated language structure, free order of words in a sentence, scarcity of linguistic data, as well as historical reasons. Therefore, the methods and models developed for other languages such as English cannot be directly adopted for Lithuanian, and the demand for an automated speech recognition (ASR) system capable of processing Lithuanian language still exists.
Recent developments in the automatic speech recognition field, mainly fueled by the advancements made by deep learning methods, have shown impressive results even for under-resourced languages [9,10]. Based on these results we believe that the state-of-the-art deep learning methods can be used to create a high accuracy ASR system for the Lithuanian language. feature concatenation and model fusion with MFCC features and the hidden Markov model (HMM) modeling technique for Polish speech recognition. Pakoci et al. [38] used the n-gram model for tuning sequence-trained long short-term memory (LSTM)-based deep neural network for Serbian.
Sequential graphemes and phonemes can be assigned to a categorical data type. The set of unique meanings of phonemes (and graphemes) is known in advance and defined during the process of extracting phoneme sequences and depends on the language of the dataset. Before transmitting categorical data into a statistical model, it is common to process it using one-hot encoding. Sutskever et al. [39] introduced the sequence to sequence model for machine translation. LSTM cells were used for the encoder and decoder parts of the model. Another model is an encoder-decoder type model with Luong et al. [40] who proposed a focus mechanism and bidirectional LSTM cell layers in the encoded and decoded portions. This type of model is called the attention model.
To our knowledge, no prior studies have examined the task of Lithuanian ASR directly from the phoneme sequences. This problem poses an interesting challenge as the same phoneme (or a group of them) can represent different graphemes. Furthermore, processing long phoneme sequence is a complicated task, because the pauses between uttered words depend solely on the talking speed of speakers. On regular ASR systems, this task would be carried out by pronunciation and language models. However, the creation of these models is complicated work requiring a lot of linguistic resources. The phoneme-based ASR system would not only be much simpler, it would also allow to use the existing acoustic models. We also believe that such a system would be appropriate for a Lithuanian language, as there exists a strong connection between phonological and morphological forms of words in the Lithuanian language.

Methods
In this section, we describe how current state-of-the-art deep learning methods, more specifically encoder-decoder models, can be modified and applied to perform automatic Lithuanian speech recognition from phoneme sequences. A comparison review conducted by Fayek et al. [41] illustrates the feasibility of feed-forward and recurrent neural network (RNN) architectures to be employed on languages with low linguistic digital resources. Finding a mapping between phoneme and grapheme sequences is not an easy task, as some phoneme sequences are shorter than their graphemic counterparts. This prevents us from using connectionist temporal classification (CTC) models, as CTC models work under the assumption that the length of the output sequence cannot be greater than the length of the input sequence [42].

Sequence-To-Sequence (Seq2Seq) Encoder-Decoder Models
Firstly, we employed the idea of Sutskever et al. [39], who showed that that a straightforward approach using two connected RNNs can be used to solve general sequence-to-sequence problems. The main idea behind their proposed encoder-decoder model is to use RNN to read the input sequence, one step at a time, and use it to create a fixed-dimensional representation (context vector), that can be used by second RNN to generate an output sequence purely based on this vector representation. The encoder converts the input sequence X = (x 1 , x 2 . . . , x T ) and converts it to a context vector c using a set of hidden states of the RNN (h 1 , h 2 . . . h T ): where h t ∈ R n is the hidden state calculated using input sequence element x t , q and f are non-linear function, for example RNNs. The purpose of the decoder is to predict next element y t of the output sequence using context vector c and sequence of previously generated elements y 1 , y 2 . . . y t −1 . In other words, the decoder defines a probability over generated output sequence Y = (y 1 , y 2 . . . , y T ) by decomposing the joint probability p(Y) into the ordered conditional probabilities: Each conditional probability can be evaluated using RNN: where c is the context vector, s t is the hidden state of decoder RNN for the element t , g -non-linear function of RNN.
The graphical representation of a basic encoder-decoder model is given in Figure 1. The context vector is the last state of the encoder. Because these models are designed to work with variable length sequences, they use two special characters: the sequence start character <SOS> and the sequence end character <EOS>. The sequence start symbol is used as the start signal for the decoding part, and after the model generates the sequence end symbol <EOS>, the generation of further sequence elements is stopped. The entire input sequence is encoded into a single context vector, which is then used to generate the output sequence during decoding. With this solution, the models perform reasonably well when the input sequences are short, but do not perform well with the longer sequences. The purpose of the decoder is to predict next element of the output sequence using context vector and sequence of previously generated elements , … . In other words, the decoder defines a probability over generated output sequence = , … , by decomposing the joint probability ( ) into the ordered conditional probabilities: Each conditional probability can be evaluated using RNN: where is the context vector, is the hidden state of decoder RNN for the element , -nonlinear function of RNN.
The graphical representation of a basic encoder-decoder model is given in Figure 1. The context vector is the last state of the encoder. Because these models are designed to work with variable length sequences, they use two special characters: the sequence start character <SOS> and the sequence end character <EOS>. The sequence start symbol is used as the start signal for the decoding part, and after the model generates the sequence end symbol <EOS>, the generation of further sequence elements is stopped. The entire input sequence is encoded into a single context vector, which is then used to generate the output sequence during decoding. With this solution, the models perform reasonably well when the input sequences are short, but do not perform well with the longer sequences. The performance of a basic encoder-decoder models deteriorates rapidly as the length of the input sequence increases. This issue arises from the necessity to compress all information about the input sequence into one a fixed-length context vector. To overcome this issue, we experimented with an extension of a basic encoder-decoder model suggested by Bahdanau et al. [43] that can use all intermediate hidden states of an encoder to construct a context vector.

Attention Encoder-Decoder Models
Models with an attention mechanism attempt to solve the problem of working with longer sequences by allowing the decoder to "look back" into the input sequence at each step of the output sequence generation. Figure 2 demonstrates the operation of the attention mechanism. Models with an attention mechanism use weights to calculate the context vector. The number of weights depends on the number of output sequence elements and are calculated by comparing each decoder state with every hidden state of the encoder ℎ : The performance of a basic encoder-decoder models deteriorates rapidly as the length of the input sequence increases. This issue arises from the necessity to compress all information about the input sequence into one a fixed-length context vector. To overcome this issue, we experimented with an extension of a basic encoder-decoder model suggested by Bahdanau et al. [43] that can use all intermediate hidden states of an encoder to construct a context vector.

Attention Encoder-Decoder Models
Models with an attention mechanism attempt to solve the problem of working with longer sequences by allowing the decoder to "look back" into the input sequence at each step of the output sequence generation. Figure 2 demonstrates the operation of the attention mechanism. Models with an attention mechanism use weights a t to calculate the context vector. The number of weights depends on the number of output sequence elements and are calculated by comparing each decoder state s t with every hidden state of the encoder h t : where T is the length of the input sequence, t is the index of the element in output sequence, and score is the scoring function. The most simplistic scoring function is just a dot product between two vectors:

Computational Complexity Analysis
In practical use, the model prediction is generated by the RNN model. Following the recommendations of Xie et al. [47], the overall computation complexity of the proposed method consists of the four computation complexities of (1), (| |), (| |), and (| |). The computation complexity of the loss function is (| |). Note, however, that the element-wise multiplication, division, and log operation can be implemented in parallel with graphical parallel unit (GPU) at (1). In contrast, the implementation of CTC [48] based on a forward-backward algorithm has a computation complexity of ( • ), here is the sequence length and is the number of lokkahead steps.
As concerns the attention models, their computation complexity is directly proportional to the number of 'attentions'. The computation complexity of the attention module itself is similar to that of CTC. Meanwhile, the attention mechanism requires the implementation of an additional module. Thus, its memory consumption is larger than that of CTC.

2.4.. Evaluation of Models
For evaluation of the ASR system, we use accuracy (ACC), Levenshtein distance, and word error rate (WER) metrics.
Accuracy is calculated as: where is the count of true positives, is the count of true negativess, is the count of false positives, is the count of false negatives. Levenshtein distance is used to measure dissimilarity between two sequences made of symbols or words expressed in terms of the smallest number of edit operations (inserts, deletes, and Once the scoring vector is known, context vector c can be calculated using Equation (7), where T is the length of the input sequence and h t is the hidden state generated by processing element t of the input sequence: The resulting context vector is concatenated with a hidden state h t from the decoder RNN, multiplied by weight matrix W c and processed using hyperbolic tangent function tanh to calculate an updated hidden state h t (see Equation (8)), which in turn is used to calculate probability of next output symbol in the output sequence (Equation (9)): For training, the following loss function is used: where L is the loss function, X = (x 1 , . . . , x T ) is the sequence of phonems, Y = (y 1 , . . . , y T ) is the corresponding sequence of graphems.
During our experiments we have also investigated to what extent strategies proposed by other researchers, more specifically, training with reversed sequences [39] and using dropout [44] layer wise constructions [45], affects the performance of our models. For dropout, we used the following probability values proposed in [46] (the probability to keep the first layer neurons of encoder and decoder is 0.8; for hidden layer neurons, the probability is 0.5).

Computational Complexity Analysis
In practical use, the model prediction is generated by the RNN model. Following the recommendations of Xie et al. [47], the overall computation complexity of the proposed method consists of the four computation complexities of The computation complexity of the loss function is O(|C ε |). Note, however, that the element-wise multiplication, division, and log operation can be implemented in parallel with graphical parallel unit (GPU) at O(1). In contrast, the implementation of CTC [48] based on a forward-backward algorithm has a computation complexity of O(T·S), here S is the sequence length and T is the number of lokk-ahead steps.
As concerns the attention models, their computation complexity is directly proportional to the number of 'attentions'. The computation complexity of the attention module itself is similar to that of CTC. Meanwhile, the attention mechanism requires the implementation of an additional module. Thus, its memory consumption is larger than that of CTC.

Evaluation of Models
For evaluation of the ASR system, we use accuracy (ACC), Levenshtein distance, and word error rate (WER) metrics.
Accuracy is calculated as: where TP is the count of true positives, TN is the count of true negativess, FP is the count of false positives, FN is the count of false negatives. Levenshtein distance is used to measure dissimilarity between two sequences made of symbols or words expressed in terms of the smallest number of edit operations (inserts, deletes, and replacements) required to perfectly match both sequences: where lev a, b (i, j) is the edit distance between the first i symbols (words) of sequence a and the first j symbols (words) of sequence b; 1 (a i b j ) is a function that evaluates to 0 if a i = b j , and to 1, if a i b j . WER is calculated as follows: where S is the number of words recognized incorrectly, I is the number of words inserted by the ASR system, D is the number of words deleted by the system, and N is the total number of words.

Dataset
Speech corpus that we used to train and evaluate models contains 86 h of annotated Lithuanian speech (406 k words) from the Lithuanian speech corpora LIEPA [49]. It is composed of short commands, single word utterances, and even long sentences. There were 248 female and 128 male speakers, 83% of this group aged 15 to 22 yrs. The original phoneme system was MBROLA, all records were in 22 kHz, 16-bit mono format. The audio data were automatically processed to extract sequences of phonemes from raw audio files. Vocabularies of phonemes and graphemes consisted of 91 different phonemes and 61 distinct graphemes, respectively.
The characteristics of datasets by phoneme length used are summarized in Table 1. When measured by their phoneme sequence lengths most recordings are relatively short (<10 phonemes), although there are some very long recordings (up to 573 phonemes). The data in our corpus exhibits strong correlation between phoneme and grapheme sequence lengths (Pearson coefficient of 0.997), however not all sequence pairs have an equal length. For our experiments, we split our corpus into 3 datasets: training dataset (60% of data,~52 h of speech, 244,860 words), validation dataset (10% of data,~26 h of speech, 40,841 words), and testing dataset (30% of data,~9 h of speech, 120,387 words). We used the validation dataset to find the most appropriate hyperparameters and evaluated models on testing dataset to find how well these models generalize the unknown/unseen data.
To test to what extent encoder-decoder models can be used to perform automatic Lithuanian speech recognition, we decided to test selected models in isolated speech and long phrase recognition tasks.

Isolated Speech Experiments
The first experiment was done to check if the models can find a mapping between phoneme sequences and their corresponding grapheme sequences. For this experiment we isolated every word (and its phoneme and grapheme sequences) in the training, validation, and testing datasets. We then trained our models using words from the training dataset and used the validation dataset to find the most appropriate hyperparameters. The models were trained using Adam optimization [50] with a learning rate of 0.01. To speed up training, we used the teacher-forcing method and applied a gradient clipping to keep the gradients of artificial neural networks between −5 and +5. We used batches of 32 sequences and made sure that all sequences were roughly the same length by bucketing sequences into buckets by their lengths, so that lengths of sequences in each bucket would not differ by more than 4. Finally, we used one-hot encoding to encode phonemes and graphemes as categorical variables.
To find the hyperparameters of the best performing model we tested several different configurations. More specifically, we trained 16 different models, which differed in the number of LSTM layers (from 2 to 5) and number of LSTM cells per layer (32, 64, 128, or 256).
The results of this experiment are presented in Tables 2 and 3. We found that the best performing encoder-decoder model without attention mechanism had 2 LSTM layers with 128 LSTM cells per layer in encoder and decoder parts (Seq2Seq-2x128). Our best performing model with an attention mechanism had 2 LSTM layers with 64 bidirectional LSTM cells (Attention-2x64). Furthermore, we found, that when trained with reversed sequences, models perform slightly better. We have not found any benefits of using dropout even with deeper, wider networks. The encoder-decoder models can learn to map phoneme sequences to their corresponding grapheme sequences, i.e., recognize uttered words solely from their phoneme sequences. Furthermore, it can achieve high recognition accuracy (>99%). Interestingly, we found that both the sequence-sequence and attention models behave similarly on all datasets.
To get a more accurate estimate of model performance and to ensure that our models do not suffer from sampling-bias, which may have occurred while splitting the original corpus into 3 datasets, we chose to train and test our models using 10-fold cross-validation. The cross-validation results, as presented in Figure 3, confirm that that both the Seq2Seq-2x128 and Attention-2x64 models can attain high accuracy (>99%). A relatively small spread around median accuracy proves that models do not overfit. Therefore, we claim that the encoder-decoder models generalize well and can learn to map phoneme sequences to grapheme sequences. learn to map phoneme sequences to grapheme sequences.
With the longer learning time of ANN, it can be seen that the loss function decreases steadily during the learning process, but the error with the verification dataset begins to increase over time.
To avoid this, we use an early stopping strategy, which does not choose the final model parameters obtained at the end of the learning process, but the parameters with which the model has reached the minimum value of the loss function with the verification dataset. We also evaluated the two best models using the Levenshtein distance metric (see Table 4). Low values of mean and median Levenshtein distance metric confirm that the Seq2Seq-2x128 and Attention-2x64 models can recognize words well and the errors are rare. Table 4. Comparison of two best models on the testing dataset. With the longer learning time of ANN, it can be seen that the loss function decreases steadily during the learning process, but the error with the verification dataset begins to increase over time.

Model Loss Value Accuracy Mean Levenshtein Median Levenshtein
To avoid this, we use an early stopping strategy, which does not choose the final model parameters obtained at the end of the learning process, but the parameters with which the model has reached the minimum value of the loss function with the verification dataset.
We also evaluated the two best models using the Levenshtein distance metric (see Table 4). Low values of mean and median Levenshtein distance metric confirm that the Seq2Seq-2x128 and Attention-2x64 models can recognize words well and the errors are rare.

Recognition of Long Phrases
The second experiment was performed to investigate how well these models could be applied to recognition of long phrases. This is a much more demanding task, as it requires the model not only to learn a mapping between phoneme and grapheme sequences, but also to be able to distinguish which part of a phoneme sequence belongs to which word, i.e., to find beginnings and endings of the words. For this experiment, we sampled sequences from the original datasets by taking every combination of consecutive words in the recording, for example, in an utterance of 3 words (α, β, γ) we created 6 different sequences: (α), (β), (γ), (α, β), (β, γ), (α, β, γ). The phoneme and grapheme sequences corresponding to these words were merged together. The spaces were added to the grapheme sequences to mark the word boundaries. The resulting sequences were filtered by their lengths to create 6 different dataset groups. Each dataset group was bounded by the longest sequence of phoneme in that group.
We used models from previous experiment (Seq2Seq-2x128 and Attention-2x64) and trained them on each dataset group in order to find (a) whether models can learn to distinguish words when there is more than one word in the utterance; (b) how sequence length affects performance of our models. As in the previous experiment, we used the Adam optimization, however, this time we lowered the learning rate to 0.001. We trained our models using reversed sequences and clipped the gradients using threshold −5 and +5. Each model was trained for 20 epochs with early stopping.
The results of our experiment are presented in Tables 5 and 6. Table 5 shows that models can achieve high accuracy even when trained with phoneme sequences from multiple words. The best accuracy was achieved by Seq2Seq-2x128 (0.990) and Attention-2x64 (0.992).  Table 6 indicates that models can reach very low WER values. Interestingly, few configurations (e.g., Seq2Seq-2x128 model was trained with dataset, where the longest phoneme sequence is 28 phonemes long, while Attention-2x64 model was trained with the longest phoneme sequence of 20 phonemes) show very large differences between validation and testing accuracy metrics. The same pattern can be observed in the WER results.
Our results indicate that models cannot only map phonemes to corresponding sequences of graphemes, but also learn to find word boundaries and distinguish words in these sequences. The encoder-decoder models can be used as a viable option for speech recognition from phone sequences.
Interestingly, we can observe that the models trained on a dataset, where the longest phoneme sequence is just 8 phonemes, perform much worse than the models which were trained with longer sequences. While it is unclear, what causes models to underperform with short sequences, we speculate that this might be due to a much smaller training dataset.
As with isolated speech experiments, we tested the performance of our models using 10-fold cross-validation. The cross-validation results are presented in Figures 4 and 5.
Interestingly, we can observe that the models trained on a dataset, where the longest phoneme sequence is just 8 phonemes, perform much worse than the models which were trained with longer sequences. While it is unclear, what causes models to underperform with short sequences, we speculate that this might be due to a much smaller training dataset.
As with isolated speech experiments, we tested the performance of our models using 10-fold cross-validation. The cross-validation results are presented in Figure 4 and Figure 5.   Interestingly, we can observe that the models trained on a dataset, where the longest phoneme sequence is just 8 phonemes, perform much worse than the models which were trained with longer sequences. While it is unclear, what causes models to underperform with short sequences, we speculate that this might be due to a much smaller training dataset.
As with isolated speech experiments, we tested the performance of our models using 10-fold cross-validation. The cross-validation results are presented in Figure 4 and Figure 5.   By comparing Seq2Seq-2x128 and Attention-2x64 results, we can see that the model with an attention mechanism (Attention-2x64) is much more stable with both shorter and longer sequences, as high recognition accuracy was achieved on almost all folds. The basic sequence-sequence encoder-decoder model, on the other hand, does not perform as well as the model with attention mechanism, and the results of each fold have more variability, as indicated by a larger interquartile range (see, e.g., sequences of T = 20 and T = 24).
The WERs, of which models managed to achieve during cross-validation, are presented in Figures 6  and 7. It is apparent that the Seq2Seq-2x128 model is less stable, as compared to the attention-2x64 model with an attention mechanism, as indicated by higher interquartile range.
While in our experiments we limited phoneme sequences to 28 phonemes, we believe that larger encoder-decoder models can be used with even longer sequences. By comparing Seq2Seq-2x128 and Attention-2x64 results, we can see that the model with an attention mechanism (Attention-2x64) is much more stable with both shorter and longer sequences, as high recognition accuracy was achieved on almost all folds. The basic sequence-sequence encoderdecoder model, on the other hand, does not perform as well as the model with attention mechanism, and the results of each fold have more variability, as indicated by a larger interquartile range (see, e.g., sequences of T = 20 and T = 24).
The WERs, of which models managed to achieve during cross-validation, are presented in Figure  6 and Figure 7. It is apparent that the Seq2Seq-2x128 model is less stable, as compared to the attention-2x64 model with an attention mechanism, as indicated by higher interquartile range.
While in our experiments we limited phoneme sequences to 28 phonemes, we believe that larger encoder-decoder models can be used with even longer sequences.

Discussion
Although the field of automatic Lithuanian language recognition is not new, there are few ASR applications and related research in the field of Lithuanian language processing. The main reasons are the complexity of the Lithuanian language and the lack of linguistic data. The Lithuanian language has a complex pronunciation, a flexible word order and a rich vocabulary, while the spelling is supplemented with the Latin alphabet. Analyzing and comparing existing systems, solutions, and comparisons is not easy-the Lithuanian speech recognition systems differ not only in datasets used for training (phoneme formats, record quantities, their contents, etc.), but also in recognition methodologies and purposes used (some systems are limited to phoneme recognition in acoustic signals, others are able to work on recognizing individual words as well).
Our algorithm provides a major impact in the implementation: The method does not require any heuristic loss functions, explicit phonetic representations, or assumptions of conditional independence, as in more traditional approaches. In classical automatic speech recognition systems, such phoneme sequence recognition is performed using a consonant pattern (dictionary), which allows words to be recognized by their phoneme sequences, and the resulting words are refined by the speech model during the decoding step. Building pronunciation and language models is a complex and lengthy process, especially for morphologically rich languages, and the resulting models do not guarantee good results for the ASR system. On the other hand, in speech recognition

Discussion
Although the field of automatic Lithuanian language recognition is not new, there are few ASR applications and related research in the field of Lithuanian language processing. The main reasons are the complexity of the Lithuanian language and the lack of linguistic data. The Lithuanian language has a complex pronunciation, a flexible word order and a rich vocabulary, while the spelling is supplemented with the Latin alphabet. Analyzing and comparing existing systems, solutions, and comparisons is not easy-the Lithuanian speech recognition systems differ not only in datasets used for training (phoneme formats, record quantities, their contents, etc.), but also in recognition methodologies and purposes used (some systems are limited to phoneme recognition in acoustic signals, others are able to work on recognizing individual words as well).
Our algorithm provides a major impact in the implementation: The method does not require any heuristic loss functions, explicit phonetic representations, or assumptions of conditional independence, as in more traditional approaches. In classical automatic speech recognition systems, such phoneme sequence recognition is performed using a consonant pattern (dictionary), which allows words to be recognized by their phoneme sequences, and the resulting words are refined by the speech model during the decoding step. Building pronunciation and language models is a complex and lengthy process, especially for morphologically rich languages, and the resulting models do not guarantee good results for the ASR system. On the other hand, in speech recognition systems that perform the recognition from the beginning to the end of a signal unit, the concept of phonemes does not even exist because of their latent representation in the model.
Another positive impact is provided in the actual cost of resources needed to train the system (especially in comparison to the classic, HMM-based solutions), also in the partially simplified and easier to assemble language model (defined by our network structure), proving that it is possible to reach a usable recognition accuracy for languages with little, publicly available linguistic resources. Our method proves that a positive impact could be seen on most transcription-based operations, often typically consisting of long phrases. ASR systems such as the system proposed in this paper are based on the phoneme sequences processing, which is simpler than classical speech recognition systems and has the flexibility to use pre-existing acoustic models. Learning to use the phoneme sequences allows the ASR system to master not only the relationship between the graphemes and their corresponding phonemes or groups, but also to learn how to distinguish between individual words, i.e., to form a language model. The close relationship between phonetic and morphological forms of words in Lithuanian allow the proposed ASR system to achieve high speech recognition results.
Summarizing the results obtained during the experiment, it can be stated that the encoded-decoder-type models can effectively learn the relationship between phonemes and their corresponding grapheme sequences and can achieve very good results in the problem of automatic recognition of isolated words from Lithuanian phoneme sequences.

Conclusions
We have demonstrated that the encoder-decoder models can be used to perform automatic Lithuanian speech recognition by learning the mapping between phonemes and corresponding sequences of graphemes. The experimental results indicate that both the basic encoder-decoder model and a more advanced model with an attention mechanism can learn to accurately predict spoken words purely from their phonetic forms with a very high accuracy (up to 0.96).
We demonstrated that these types of models can be applied not only to isolated speech recognition tasks, but also to recognition of long phrases, as they can learn to distinguish words in longer sequences by finding their boundaries (0.998 accuracy achieved with sequences consisting of 28 phonemes). Our results suggest that the model can be successful with even longer sequences.
Compared to the performance of the Lithuanian language Google speech recognizer on the same LIEPA corpora [51], this solution was on average around 30% more efficient.
In the future, we will also apply the proposed approach for other languages, which have strong ties between the phonetic and morphological forms of the words.