Abstract
We present a semi-automatic framework for transcribing foreign personal names into Lithuanian, aimed at reducing pronunciation errors in text-to-speech systems. Focusing on noisy, web-crawled data, the pipeline combines rule-based filtering, morphological normalization, and manual stress annotation—the only non-automated step—to generate training data for character-level transcription models. We evaluate three approaches: a weighted finite-state transducer (WFST), an LSTM-based sequence-to-sequence model with attention, and a Transformer model optimized for character transduction. Results show that word-pair models outperform single-word models, with the Transformer achieving the best performance (19.04% WER) on a cleaned and augmented dataset. Data augmentation via word order reversal proved effective, while combining single-word and word-pair training offered limited gains. Despite filtering, residual noise persists, with 54% of outputs showing some error, though only 11% were perceptually significant.
Keywords:
practical transcription; character-level transduction; sequence-to-sequence learning; web-crawled data; Lithuanian MSC:
68T20
1. Introduction
Monolingual text-to-speech (TTS) systems often face difficulties in accurately pronouncing text segments originating from languages other than the system’s target language. Personal names, geographical names, organizations, and other foreign-named entities are major sources of pronunciation errors in TTS. Since foreign words follow different grapheme-to-phoneme (G2P) conversion rules than those of the target language, they must first be identified within the text. Their orthography should then be modified in a way that enables the embedded G2P module to approximate their pronunciation using the phonemic inventory of the target language. This process is referred to as practical transcription [1,2].
This paper focuses on the process of building a data processing pipeline that takes data found on the web as an input and results in the computational models performing automatic transcription of personal names of foreign origin into Lithuanian, which serves as the target language. Several challenges complicate this task: (1) the variability in how initial morphological adaptation of foreign names is already performed in texts, (2) the need to infer both the position and the type of word stress, and (3) the requirement to consider broader linguistic context—extending beyond individual words—during decision-making.
Foreign personal names often undergo primary morphological adaptation to conform to Lithuanian grammatical structures. This is typically achieved by modifying the name and/or appending Lithuanian inflectional suffixes. The State Commission of the Lithuanian Language (State Commission of the Lithuanian Language is the body that issues normative recommendations for the usage of Lithuanian in public, see https://vlkk.lt/vlkk-nutarimai/protokoliniai-nutarimai/rekomendacija-del-autentisku-asmenvardziu-gramatinimo, accessed on 25 November 2024) prescribes different adaptation strategies depending on factors such as the source language, the gender of the person, and the morphological structure of the name. Consequently, texts processed by TTS systems may contain unaltered foreign names (e.g., Silverstone), names appended with Lithuanian suffixes (e.g., Alastairis, Bruce’as), or names transformed by fusion with Lithuanian morphological endings (e.g., Alicios, the genitive form of Alicia). The latter types of names are no longer tokens in original language but mixed-language tokens. Despite these modifications, we refer to all such names as “original names,” as they appear in their raw form from the perspective of the TTS system.
Word stress is a critical perceptual feature in spoken Lithuanian, as it enhances the prominence of the stressed syllable. Lithuanian distinguishes between acute and circumflex accents, marked with specific diacritics over the stressed sound. However, such stress markings are generally absent from standard orthography and are only used in specialized texts where correct pronunciation is essential. This presents an additional challenge for transcription models, which must not only rewrite the text to align with Lithuanian G2P rules but also predict the correct stress position and accent type.
A further complication arises from the fact that transcription is often ambiguous or exhibits one-to-many mappings, particularly when the origin language of a name is unknown. For example, the name Charles may be transcribed as Čárlzas if the individual is English, or Šárlis if French. Similarly, Michael may become Máiklas (English) or Michaèlis (German). One approach to resolving such ambiguities is to incorporate a broader linguistic context—potentially involving multiple adjacent foreign words—into the transcription process.
Several approaches can be considered to address the multilingual proper name transcription problem, which we categorize as follows: (1) the multi-stage approach, (2) the Generative AI approach, and (3) the end-to-end machine learning approach.
1.1. Multi-Stage Approach
In the multi-stage approach, the transcription task is divided into several sequential steps: (i) identification of the source language of loanwords, (ii) text rewriting based on the identified source language, and (iii) stress placement. This approach has certain advantages. First, the initial step can be framed as a variant of the well-established language identification (LangID) task, for which numerous methods exist, including Naïve Bayes trained on n-grams [3], using sub-word features and vector quantization [4], neural networks [5], transformers [6], and LSTM-based models [7,8]. Second, the rewriting step can potentially achieve high accuracy if it leverages a comprehensive set of deterministic rules for adapting foreign names in the target language.
However, this approach also has significant limitations:
- Proper names are very short text segments, and LangID performance is known to degrade on short text segments [9,10].
- The TTS system needs to identify and to transcribe mixed-language tokens. This poses a challenge, as no off-the-shelf LangID tools are designed to handle such inputs. Addressing this requires developing a specialized LangID system, for which neither pre-defined rules nor training data currently exist.
- For many languages, including Lithuanian, no complete and explicit rule set exists for adapting foreign names—only broad linguistic guidelines are available. Constructing such a rule set would require substantial linguistic expertise and manual effort.
- Error propagation is inherent to multi-stage approaches: mistakes in earlier stages (e.g., misidentified language) are likely to impact subsequent steps, thereby degrading the overall system performance.
1.2. Generative AI Approach
In the Generative AI approach, the transcription task is divided into two stages: (i) querying a large language model (LLM) to generate the International Phonetic Alphabet [11] (IPA) transcription of a given personal name, and (ii) converting the resulting IPA representation into Lithuanian orthography using a specialized IPA-to-grapheme converter. The main advantages of this approach include the wide availability of pre-trained LLMs and the relative ease of developing an IPA-to-grapheme converter, which can be based on accurate and deterministic phoneme-to-letter conversion rules.
Table 1 presents several examples of original names and their corresponding IPA transcriptions, as generated by GPT-4o [12], in response to queries such as “What is the IPA pronunciation of [person’s name]?” and “Please transcribe the following name in IPA symbols: [person’s name].”
Table 1.
Examples of personal names and IPA transcriptions provided by GPT-4o.
As seen in rows 1–8, the LLM generally provides accurate transcriptions for unmodified personal names, regardless of the language of origin. However, for mixed-language tokens that have already undergone some degree of morphological adaptation—common in Lithuanian texts—such as in rows 9 and 10, the model fails to produce correct IPA forms (the acceptable transcriptions of Laurent’as Fabiusas and Ralphas Fiennesas are /lʲɔˈra:nɐs fɐˈbʲʊsɐs/ /2ˈreɪfɐs ˈfaɪnzɐs/, respectively). This limitation indicates that pre-trained LLMs alone may not be sufficient for the transcription task, particularly when handling non-standard forms. Fine-tuning [13] or instruction-tuning LLMs [14,15] could potentially improve their performance on the task in question.
1.3. End-to-End Machine Learning Approach
In the end-to-end machine learning approach, the transcription task is performed in a single step, following a preparatory (offline) phase in which a model is trained on a dataset of input–output pairs—original names and their corresponding transcriptions. The trained model effectively encodes a deterministic rule set that implicitly integrates language identification, text rewriting, and word stress placement.
The primary advantage of this approach is that, once trained, the model can be directly applied to previously unseen inputs with minimal additional processing. However, there are several disadvantages: (i) substantial effort may be required to compile a high-quality, representative, and sufficiently large training dataset; (ii) the learned rules are typically encoded in model parameters (e.g., neural network weights), which are not easily interpretable or modifiable; and (iii) any errors or biases in the training data may be propagated throughout the model’s predictions, with limited opportunities for targeted correction.
While acknowledging the validity of the first two approaches described in Section 1.2 and Section 1.3, we adopt the latter approach and formulate transcription as a supervised end-to-end machine learning problem. This choice is based on the assumption that it is feasible to collect a sufficiently large and representative dataset of personal names—originating from various source languages—and their corresponding Lithuanian transcriptions from online sources. Our objective is to investigate methods for cleaning and structuring this data, as well as to train and evaluate different models capable of learning the mapping from original to transcribed forms.
This research seeks to address the following key questions:
- Is it possible to develop a fully automated data processing pipeline that converts raw web-crawled data into a dataset of sufficient quality for training practical transcription models?
- What level of transcription accuracy can be achieved using such automatically generated data, and how does this accuracy compare to that of human transcribers?
- To what extent do models trained on word pairs outperform those trained on single-word inputs?
- How effective are different data augmentation strategies in enhancing the performance of transcription models?
To the best of our knowledge, the task of training end-to-end models for the transcription of foreign words into Lithuanian has not been addressed in previous research.
1.4. Related Work
Early approaches to translation and transcription relied on handcrafted, context-dependent phonological rules informed by linguistic expertise [16,17,18]. While these methods offered transparency and control, they were labor-intensive and lacked robustness across language domains [19]. The shift toward data-driven techniques began with statistical machine translation models such as n-gram and maximum entropy models [20], which leveraged aligned parallel translation corpora. However, these approaches struggled in low-resource settings due to data sparsity.
Modern translation methods primarily utilize sequence-to-sequence (seq2seq) models, which revolutionized the handling of variable-length input and output sequences. The introduction of encoder–decoder architectures based on recurrent neural networks (RNNs) by Sutskever et al. [21] enabled end-to-end learning for machine translation tasks. The subsequent addition of attention mechanisms by Bahdanau et al. [22] significantly improved performance by allowing models to dynamically focus on relevant parts of the input during decoding.
Due to the limitations of RNNs in modeling long-range dependencies, alternative architectures emerged, including convolutional neural networks [23] and self-attention-based models. The Transformer model introduced by Vaswani et al. [24] replaced recurrence with multi-head self-attention, enabling parallelization and achieving state-of-the-art results across various translation tasks. These models also support transfer learning across scripts and languages, with multilingual pretrained encoders such as mBERT and XLM-R proving especially effective [25,26].
Character-level translation, such as the transcription task described in this paper, is still largely dominated by attention-based LSTM seq2seq models [27,28], largely because the task typically involves monotonic alignments between input and output sequences that share significant character-level similarity. Unlike tasks requiring semantic understanding or long-range dependencies, transcription benefits less from the Transformer’s memory capacity. Nevertheless, Ref. [29] demonstrated that transformer performance on character-level tasks is highly sensitive to batch size, and that, given sufficiently large batches, transformers can outperform RNN-based models even in these settings.
In this research, we train seq2seq models—including LSTM-based encoder–decoder architectures [28] and transformers [29], both of which have shown strong results in multilingual grapheme-to-phoneme (G2P) tasks [30]—to solve the transcription problem. We also investigate the effects of data augmentation, which has been shown to improve model accuracy in related tasks. Differently from Ref. [30], our task presents additional challenges: we begin with a much larger and noisier dataset, must handle conflicting transcription labels, and incorporate stress marker prediction, adding complexity to both data preprocessing and model training.
The main contributions of this paper are as follows:
- We propose a novel semi-automatic data processing pipeline that transforms raw web-crawled data into a training set for practical transcription tasks. This pipeline is demonstrated on Lithuanian—a morphologically rich and challenging language—where it effectively processes, filters, and normalizes data containing inflected and/or mixed-language tokens. The pipeline is potentially fully automatable for languages that do not require modeling of word stress type and location.
- We show that an end-to-end practical transcription model can be trained using this dataset, despite residual noise in the processed data. Although exact word error rate (WER) estimation is hindered by noisy reference labels in the test set, we report an upper-bound WER of approximately 19%, with the actual performance likely being significantly better—potentially nearly half that value.
2. Materials and Methods
2.1. Data
We hypothesized that data for the transcription task could be collected from online sources. It has been observed that Lithuanian news portals frequently present foreign person names in their original form followed by a transcription (intended pronunciation) in parentheses. This practice helps readers unfamiliar with foreign orthography approximate the correct pronunciation.
To leverage this pattern, we performed web scraping across major Lithuanian news portals, collecting 133,254 text segments (68,167 unique patterns) from public articles spanning the past ten years. The data was extracted using a Perl-style regular expression designed to capture two title-cased words followed by another pair of title-cased words enclosed in parentheses (the raw material is made openly available and can be downloaded from Clarin LT repository at https://clarin.vdu.lt/xmlui/handle/20.500.11821/68, accessed on 7 May 2025).
[[:upper:]][-\x27[:lower:]]+\s+[[:upper:]][-\x27[:lower:]]+\s+\( [[:upper:]][-\x27[:lower:]]+\s+[[:upper:]][-\x27[:lower:]]+\s*\)
This raw dataset [31], however, proved to be noisy and ambiguous, requiring extensive preprocessing before it could be used for model training. The main sources of noise included:
- Unintended Matches: The regular expression sometimes captured pairs of proper nouns that were not person names and their transcriptions, but rather unrelated entities, such as person–location, person–actor, person–sports team, or location–location combinations (see Table 2, rows 1–5).
Table 2. Sample patterns extracted from web-scraped Lithuanian news texts. - Word Order Inversion: The word order in one of the name pairs was sometimes reversed relative to the other (see Table 2, row 8).
- Inflection Mismatch: The original name with concatenated or fused inflectional suffixes could be in a different grammatical case than the transcribed name. For instance, the original name Alexis Tsipras is in the nominative, while the transcription Aleksiui Ciprui is in the dative case (Table 2, row 15). Particularly challenging were these mismatches when intertwined with role inversion. For example, the name pair Alberto Alonso (Table 2, row 16) could be interpreted as a non-inflected original form, an adapted original form (e.g., genitive of Albert Alons), or a transcription in the genitive case.
- Multiple Inflections: A single non-inflected original name could correspond to multiple transcriptions, each in a different grammatical case depending on the context (Table 2, rows 11–14).
- Inconsistent Labeling and Human Errors: Different transcriptions of the same original name were observed due to the varying linguistic knowledge of human editors (Table 2, rows 17–19). Additionally, spelling errors appeared in both original names (Table 2, row 20) and transcriptions (Table 2, row 21).
2.2. Method
Our transcription system follows a structured data processing pipeline, as illustrated in Figure 1. It begins with web-scraped textual data and ends in computational models capable of automatically transcribing personal names of foreign origin into Lithuanian.
Figure 1.
Overview of the data processing pipeline from raw web data to trained transcription models.
The methodology consists of several sequential steps:
- Preprocessing and cleaning of web-scraped data, including filtering, normalization, and reordering of raw data patterns.
- Adding word stress location and type information to the cleaned data.
- Generation of multiple training sets, based on different configurations and augmentation strategies.
- Training of machine learning models, using a range of architectures and training setups and evaluation of their accuracy on held-out test data.
2.2.1. Data Preprocessing
Preprocessing was essential to address the various noise sources described in Section 2.1 and to prepare a clean, structured dataset for model training.
The first preprocessing step involved alignment of original and transcribed strings, which served as a filtering mechanism for the raw patterns. This alignment assumed that valid transcriptions share structural and phonetic similarities—specifically, that the two strings should be composed of corresponding characters or character groups.
For instance, Table 2 (rows 1–5) illustrates clearly invalid alignments, where the initial capital letters of the first and second words in each pair do not match the capitals in the parenthetical pair. However, this matching should not be interpreted as strict character identity, since valid transcriptions often include predictable orthographic adaptations (see rows 11, 15, and 17) (for instance, matching substrings Ch, Ts, C and G to Č, C, K, and Dž, respectively, should be permitted).
To accommodate this, we implemented a dynamic programming alignment procedure called Match(). It identifies correspondences between the original and transcribed forms based on approximately 600 permitted edit operations, including both context-free and context-sensitive substitutions. These operations were non-symmetric—for example, replacing “ault” with “o” is allowed (as in Renault → Reno), but the reverse is not permitted, since such transformations are not linguistically plausible in the Lithuanian context.
Alignment was further complicated by morphological variation: both the original and transcribed names could appear in various inflected forms, sometimes with differing grammatical cases and suffixes (e.g., Table 2, rows 11 and 15). To mitigate this, morphological stemming was applied before alignment, reducing both strings to their stemmed forms. Due to possible ambiguities in stemming, a search over the space of candidate stem pairs was necessary.
The simplified pseudo-code below (see Figure 2) illustrates the logic of the NormalizeWordOrder() function. This function takes four word stems: s1, s2, s3, and s4, and infers the roles (original or transcription) and the correct word order based on successful alignments.
Figure 2.
Simplified pseudo-code of the word role and order normalization function, which outputs a normalized transcription pattern from four word stems.
Pattern filtering based on the alignment between the original text and its transcription helped eliminate incorrect transcription patterns and normalize instances of role and word order inversion. It also contributed to filtering out some patterns containing spelling errors (e.g., rows 20–21 in Table 2). However, many inconsistent transcriptions remained (see Table 3). This was largely due to the aligner having too much flexibility, allowing all types of editing operations to be applied simultaneously—even when those operations were more appropriate for different languages.
Table 3.
Complete list of raw patterns and their normalized versions related to the original name Alicia Keys.
The process of morphological analysis, inflectional paradigm inference, and case normalization significantly reduced both inflectional mismatches and the occurrence of multiple transcriptions derived from a single original name—common in Lithuanian due to the presence of inflectional suffixes. This analysis was performed using purpose-built tools that relied on inflectional endings to make decisions.
As a result of this step, the cases listed in Table 2 (rows 11–14) were consolidated into a single normalized form: Charles Darwin (Čarlz Darvin), appearing 23 times in total.
To further minimize the issue of one-to-many transcription mappings, statistical pattern filtering was applied. A simple rule was introduced: transcriptions occurring less frequently than a specified threshold were discarded (see Table 4). This threshold was empirically set to 5% of the total number of occurrences of a given original name. The goal was to strike a balance between removing as much human-induced noise as possible while preserving legitimate transcription variants found across different languages.
Table 4.
The effects of statistical filtering on several frequently occurring names. Frequencies are shown in parentheses.
2.2.2. Gathering Stress Data
In Lithuanian, stress placement and type are determined by the morphological properties of a word form. While there are algorithms that can infer stress location and type from Lithuanian orthography [32,33] and the practical implementations available [34], these techniques are dictionary-based and require that word forms be annotated with their morphological properties and accentuation paradigms. However, both the original and the transcribed word forms in our study fall outside the scope of standard Lithuanian dictionaries.
As a result, stress annotation had to be performed manually, making it the only step in our data processing pipeline that prevented it from being fully automatic. Linguists were instructed to place stress marks on the transcribed forms to best reflect the pronunciation of the original language. To speed up the process, stress was annotated only on isolated, single-word transcriptions in the nominative case. These annotations were then algorithmically propagated to corresponding word-pair patterns and other inflected forms.
This manual annotation process inevitably introduced some noise into the data. The linguists were not proficient in the pronunciation and accentuation of all source languages encountered in the dataset, and in many cases had to choose between multiple plausible options. For example, the name Marielė, which may originate from English (Mariel) or French (Marielle), would require different stress placements—Mãrielė for the English version and Marièlė for the French one. Without context, such distinctions were impossible to make.
2.2.3. Generating Training and Test Sets
The filtered, normalized, and stressed data comprised 118,149 total patterns (51,429 unique ones), each of the form O1 O2 → T1 T2, where O1 and O2 are original personal names—written either with or without primary morphological adaptation—and T1 and T2 are their respective transcriptions.
Two base datasets, W1 and W2, were derived from this normalized data, as shown in Table 5. To evaluate the effectiveness of data augmentation in improving transcription accuracy, two additional augmented datasets—W2+R2 and W2+R2+W1—were created.
Table 5.
Datasets for the sequence-to-sequence transcription task.
We chose to retain conflicting transcription labels, as automatically removing them would have required excluding instances from less frequently used languages. Importantly, the frequency with which a pattern appears in the dataset is likely correlated with the correctness of its transcription. Therefore, we opted not to reduce the datasets to unique patterns. Instead, instances were preserved with their original frequencies, allowing repetition.
Because of these conflicting labels, the maximum achievable accuracy is less than 100%. To estimate an upper bound, we calculated the Oracle accuracy, which assumes perfect prediction for non-conflicting cases and selects the most frequent label in cases of conflict (see Table 5).
All datasets were randomly split into training (90%) and testing (10%) subsets. To maintain consistency, the split ensured that not only identical patterns but also all inflectional variants of a given pattern remained within the same partition. For instance, the accusative and genitive forms:
were placed in the same partition of the W2 dataset. However, this constraint was not enforced at the single-word level; thus, data instances like O1 O2 → T1 T2 and O1 O3 → T1 T3 could appear in the same split even if they shared the same sub-component O1→ T1.
Louisą Zamperini → Luĩsą Zamperìni
Louiso Zamperini → Luĩso Zamperìnio
The W2 dataset was partitioned prior to augmentation, ensuring that W2, W2+R2, and W2+R2+W1 shared the same patterns in training and test subsets. This allowed for fair comparisons between non-augmented and augmented models, all evaluated using W2’s test set.
In all experiments, we encoded Lithuanian stress markers as unique symbols appended after the stressed character. Before training, each instance was split into individual characters. The input symbol set contained 64 unique characters, while the output symbol set contained 38.
2.2.4. Training Transcription Models
We trained three different transcription models, each representing a distinct approach to sequence modeling: a symbolic model based on weighted finite-state transducers (WFSTs), a recurrent neural network (RNN)-based sequence-to-sequence (seq2seq) model, and a Transformer-based seq2seq model. These models were selected to explore the trade-offs between model interpretability, training efficiency, and transcription accuracy.
Our first model was a weighted finite-state transducer (WFST), a symbolic approach rooted in probabilistic automata. The model architecture follows the pair n-gram framework [35] and builds on earlier work such as the hidden Markov model-based G2P framework [36]. As in Lee et al. [37], transcriptions are generated by composing the input string with a trained transducer, which yields a lattice of possible output sequences annotated with their associated probabilities. The best hypothesis is then selected using Viterbi decoding [38].
We used the OpenGrm [39] tookit and OpenFst-based [40] Pynini library [41] to train the transducer. Different n-gram orders (n = 2 to 10) were evaluated to balance context sensitivity with model complexity.
The second model we explored is a neural seq2seq architecture based on long short-term memory (LSTM) units. The encoder consisted of a single bidirectional LSTM layer, while the decoder was a single unidirectional LSTM. An attention mechanism [27] bridges the encoder and decoder, enabling the model to dynamically focus on relevant parts of the input sequence during decoding.
We used the Fairseq [42] implementation provided by Ref. [30], with default hyperparameter settings for key training parameters such as learning rate, weight decay, gradient clipping, and label smoothing [43]. For training, we used an Adam optimizer with inverse square root learning rate scheduling and applied early stopping based on validation loss.
Our third model was a Transformer-based neural seq2seq model, optimized for character-level transduction and described by Wu et al. [29]. This architecture replaces recurrence with multi-head self-attention mechanisms and position-wise feedforward layers, enabling efficient parallel training and improved modeling of long-range dependencies.
We adopted a four-layer Transformer architecture for both the encoder and the decoder, each using pre-layer normalization to improve training stability. Again, we used the Fairseq implementation provided by Ref. [30]. Dropout and label smoothing were applied to mitigate overfitting.
For both neural models and for each data set, we conducted hyperparameter tuning on a held-out development set. The following parameters were adjusted: the dimensionality of the encoder embedding layer (EEL), encoder hidden layer (EHL), decoder embedding layer (DEL), and decoder hidden layer (DHL), along with the dropout rate (DOUT) and batch size (BSIZE). Grid search was used to identify optimal configurations. The beam search decoding algorithm was employed with beam sizes ranging from 3 to 10, depending on the model size and sequence complexity. Hyperparameter settings were kept consistent across models as much as possible to ensure fair comparison. Detailed results of the ablation experiments for the Transformer model are provided in Appendix A.
2.3. Evaluation Metrics
To evaluate and compare the performance of the transcription models, we employed word error rate (WER) as the primary metric. WER was defined as the proportion of word instances in which the predicted transcription differed from the reference transcription, with a lower WER indicating superior model performance. In the calculation of WER, each word-pair instance was treated as comprising two discrete word forms.
To further disentangle the contribution of stress placement errors from other symbol-level errors within the WER, we introduced an auxiliary metric termed the stress-compensated word error rate (WER-s). The WER-s is computed analogously to the WER, but after all stress markers have been removed from both the predicted and the reference transcriptions. The difference between the WER and the WER-s thus reflects the proportion of total errors attributable exclusively to incorrect stress placement (i.e., position and/or type).
3. Results
3.1. Model Accuracy
Weighted finite-state transducer, neural encoder–decoder, and transformer models were successfully trained. On the training sets, the word error rates (WERs) achieved by these models were close to the oracle WER values reported in Table 5. However, their performance on the test sets showed significantly higher WERs, indicating a drop in generalization. Table 6 presents the best results obtained for each model type and dataset, based on a search over the respective hyperparameter spaces.
Table 6.
Comparison of model accuracy on base and augmented transcription tasks. Estimated word error rate (WER) and stress-adjusted WER (WER-s), along with their 95% confidence intervals, are reported. The bold number indicates a statistically significant difference between the best and the second-best result.
The WFST model demonstrated the lowest performance among the evaluated approaches. The considerable gap between the WER and the WER-s (see Figure 3) indicates that it struggled particularly with accurately predicting stress location and type compared to the neural models. Detailed analysis of its output revealed frequent issues, such as missing or multiple stress markers, even though each training instance contained exactly one stress marker. Additionally, data augmentation negatively affected the WFST model, likely due to increased distributional mismatch between the training and test data.
Figure 3.
Word error rate (WER) and stress-adjusted word error rate (WER-s) estimated on the test partitions for WFST-based models trained on single-word (a) and word-pair (b) datasets.
In contrast, both neural models—encoder–decoder and transformer—performed significantly better. Across all training sets, the best or near-best results were achieved using a smaller encoder (EEL = 128, EHL = 512) and a larger decoder (DEL = 256, DHL = 1024).
Overall, the transformer model outperformed the encoder–decoder model, particularly in the single-word transcription task (W1), where the differences in WER and WER-s are statistically significant. For augmented word-pair tasks, the transformer also achieved a significantly lower WER. However, the differences in WER-s were not statistically significant, suggesting that both models are comparable in basic character-to-character transcription, while the transformer demonstrates superior handling of stress regularities.
Interestingly, both models performed worse on the W2+R2+W1 dataset compared to W2+R2, suggesting that not all types of data augmentation are beneficial. We attribute this decline to the mismatch between training and test data distributions—specifically, the inclusion of single-word training instances in W2+R2+W1, while the test set contained only word-pair instances.
3.2. Detailed Error Analysis
Since the training datasets were generated via a semi-automated data processing pipeline, we conducted a more in-depth investigation into the nature of transcription errors. The term “error” can be misleading in this context, as discrepancies between the reference transcription (in the test set) and the model’s prediction are not necessarily genuine mistakes—particularly when the reference itself is incorrect.
To explore this further, we selected the 100 most frequent unique transcription mismatches (representing 9.4% of the total error mass) made by the best-performing model (a Transformer trained on the W2+R2 dataset) and attempted to systematically categorize them.
First, we manually reviewed and corrected the reference transcriptions where necessary. These 100 adjusted entries form what we refer to as the golden standard. Next, we categorized the observed mismatches—both between the original references and the golden standard, and between the model predictions and the golden standard—into four levels of severity, as shown in Table 7.
Table 7.
Categories of transcription errors.
The distribution of these mismatches is captured in Table 8, which cross-tabulates the reference vs. predicted errors by category:
Table 8.
Cross-tabulation of reference and prediction mismatches with respect to the golden standard. False errors (green, 13%) occur when the model’s prediction matches the golden standard. Cases where both the prediction and the reference are different from the golden standard and prediction are better (light green, 4%), comparable (27.3%), or worse (light red, 9.5%) than the reference according to the mismatch severity. True errors (red, 44%) occur when the reference matches the golden standard while the prediction does not.
Assuming this sample is representative of the full dataset, we can presume that the training and test data still contain a significant amount of noise. In 53.9% of test instances, the reference transcription deviates from the golden standard even though only 11.5% of instances show perceptually significant mismatches. This suggests that the pattern-filtering techniques used in the preprocessing pipeline (as described in Section 2.2.1) are effective at reducing severe errors but allow smaller inconsistencies to persist.
Consequently, the WER and WER-s metrics reported earlier should be interpreted with caution. They likely represent pessimistic estimates of actual model performance.
Analysis of the largest error category—perceptually prominent mismatches—reveals several recurring error types. The most common are “language misidentification” (the term language identification may be imprecise in this context, as the inputs often consist of mixed-language tokens rather than text in a single, clearly defined language) errors, as if the model had applied transcription rules from a more frequent language (typically English) to names from less represented languages. Examples include:
Tusk (Polish surname) transcribed as Tãsk, as if it were English.
Mujica (Uruguayan) rendered as Mužìka instead of Muchìka, influenced by French.
Sergio (Italian) misrendered as Ser̃chijo instead of Ser̃džijo, as if it were Spanish.
Some other errors result from missing diacritics in the original names. For instance, Walesa becomes Valèsa instead of the correct Valènsa or Valeñsa. The correct source should have been Wałęsa, which carries diacritic marks crucial for accurate transcription.
The full list of the top 100 transcription errors, including their golden standard corrections and assigned categories, is provided in Appendix B.
4. Conclusions
This study represents our initial investigation into the transcription of proper names in Lithuanian. We developed a data processing pipeline to transform raw web-crawled data into a training set suitable for this practical transcription task. For Lithuanian, the pipeline is semi-automatic due to the requirement for manual stress annotation by human labelers. However, for other target languages with more regular stress patterns (e.g., French or Latvian), this step could be omitted, potentially enabling a fully automatic pipeline.
The filtering and normalization techniques applied to the crawled data were effective, reducing noise by 10.7% (based on occurrence frequency) and decreasing the number of pattern types by 24.6%. Residual noise was assessed indirectly through detailed inspection of the 100 most frequent instances in the test set. Approximately 54% of the inspected transcriptions could be improved, though only around 11% required perceptually significant corrections. These findings suggest that the training data quality is sufficient to enable convergence of sequence-to-sequence models, although it remains below the standard typically achieved by human-curated datasets.
Among the models evaluated, the best-performing was a sequence-to-sequence Transformer, which achieved a word error rate (WER) of 19.04%, or 15.66% when accounting for stress compensation, on a word-pair dataset augmented with word order reversals. While this performance still lags behind human transcription accuracy (with an oracle WER of 5.43%), it represents a substantial improvement for text-to-speech (TTS) systems that currently lack any effective loanword adaptation component.
Furthermore, models trained on word pairs outperformed those trained on isolated words. The best-performing single-word model yielded a WER of 25.49% (19.81% stress-compensated), supporting previous findings that sequence-to-sequence models benefit from extended input context. However, to draw more definitive conclusions about the relative effectiveness of model architectures or hyperparameter configurations, further reduction in residual noise in the test data is needed.
We also examined data augmentation strategies, such as word order reversal and combining single-word and word-pair datasets. The results show that word order reversal led to a 5% relative improvement in model performance, while the combination of single-word and word-pair datasets proved less effective.
Several important research and practical questions remain open to future investigation:
- Alternative approaches. Exploring alternative approaches to the practical transcription problem, as introduced in Section 1.1 and Section 1.2, remains a promising direction for future work. The Generative AI approach may yield acceptable performance in handling mixed-language tokens, particularly when pre-trained large language models (LLMs) are fine-tuned or instruction-tuned for this specific task. Within the end-to-end machine learning framework, LLMs could also assist in cleaning and normalizing raw data patterns. Furthermore, LLMs may be employed to extract supplementary features—such as speaker nationality or linguistic background—which could enhance the predictive performance of transcription models.
- Stress modeling. This study incorporated several arbitrary choices regarding stress modeling. Stress placement was integrated into the end-to-end system, with the stress mark encoded as an additional ASCII symbol following the stressed character. An alternative approach would be to treat stress placement as an independent machine learning task. The stress mark could be embedded within an accented character, forming a single output symbol. This strategy would expand the output symbol set rather than increase the length of the output sequence. Additionally, weak supervision techniques [44] or semi-supervised [45] stress modeling approaches could be explored to partially automate this task.
- Experimenting with the dataset. It is important to continue manual cleaning of the test set to establish a fully curated, gold-standard benchmark, thereby increasing confidence in the accuracy estimates. The dataset could also be enriched with longer name sequences (e.g., three or more words) to better reflect the diversity of naming conventions. Additionally, since morphological analysis is already a component of the data processing pipeline, the dataset could be augmented with inflected forms of names not originally present. For example, from the nominative form Charlesas Darwinas → Čarlzas Darvinas, one could derive genitive forms such as Charles’o Darwin’o → Čarlzo Darvino or dative forms such as Charlesui Darwinui → Čarlzui Darvinui.
- Improving data filtering. The current alignment procedure appears overly permissive. Although stricter alignment criteria may require additional linguistic resources, a promising improvement would be to assign language labels to each permitted substitution. This would restrict substitutions to those within the same language set, potentially enhancing data consistency, filtering precision, and overall transcription accuracy.
We plan to explore these avenues in subsequent studies, with the goal of building a more robust and linguistically informed transcription framework.
Author Contributions
Conceptualization, G.R. and T.K.; methodology, D.V.-A.; software, T.K. and D.K.; validation, A.M., D.A., D.K. and G.R.; resources, G.R. and D.K.; writing—original draft preparation, G.R.; writing—review and editing, D.V.-A. and A.M.; supervision, G.R. and T.K.; project administration, T.K. and A.Č. All authors have read and agreed to the published version of the manuscript.
Funding
This research/The project was co-funded by the European Union under Horizon Europe programme grant agreement No. 101059903; and by the European Union funds for the period 2021–2027 and the state budget of the Republic of Lithuania financial agreement Nr. 10-042-P-0001.
Data Availability Statement
The original data presented in the study are openly available in Clarin LT repository at https://clarin.vdu.lt/xmlui/handle/20.500.11821/68.
Acknowledgments
The authors would like to express their sincere gratitude to Regina Sabonytė for her valuable linguistic expertise regarding the placement of stress markers in the transcribed personal names.
Conflicts of Interest
The authors declare no conflicts of interest.
Abbreviations
The following abbreviations are used in this manuscript:
| WFST | Weighted Finite State Transducer |
| LSTM | Long Short-Term Memory |
| WER | Word Error Rate |
| WER-s | Stress-compensated Word Error Rate |
| TTS | Text-to-Speech |
| G2P | Grapheme-to-Phoneme |
| LangID | Language identification task |
| LLM | Large Language Model |
| IPA | International Phonetic Alphabet |
| GPT | Generative Pre-trained Transformer |
| RNN | Recurrent Neural Network |
| ML | Machine Learning |
| ASCII | American character encoding standard |
Appendix A
Table A1.
Results of ablation experiments using the Transformer model. EEL, EHL, DEL, and DHL denote the dimensionality of the encoder’s (E) and decoder’s (D) embedding (E) and hidden (H) layers, respectively. The table reports the dependency of the estimated word error rate (WER) and stress-adjusted word error rate (WER-s) on different hyperparameter values.
Table A1.
Results of ablation experiments using the Transformer model. EEL, EHL, DEL, and DHL denote the dimensionality of the encoder’s (E) and decoder’s (D) embedding (E) and hidden (H) layers, respectively. The table reports the dependency of the estimated word error rate (WER) and stress-adjusted word error rate (WER-s) on different hyperparameter values.
| Task | EEL | EHL | DEL | DHL | Batch Size | Dropout Rate | WER, % | WER-s, % |
|---|---|---|---|---|---|---|---|---|
| 128 | 512 | 128 | 512 | 256 | 0.1 | 29.13 ± 0.85 | 22.99 ± 0.76 | |
| 128 | 512 | 128 | 512 | 256 | 0.3 | 26.95 ± 0.29 | 21.22 ± 0.18 | |
| 128 | 512 | 128 | 512 | 1024 | 0.1 | 27.88 ± 0.43 | 21.51 ± 0.48 | |
| 128 | 512 | 128 | 512 | 1024 | 0.3 | 28.44 ± 0.29 | 22.33 ± 0.19 | |
| 128 | 512 | 256 | 1024 | 256 | 0.1 | 27.90 ± 0.19 | 22.42 ± 0.19 | |
| 128 | 512 | 256 | 1024 | 256 | 0.3 | 27.50 ± 0.39 | 21.96 ± 0.35 | |
| 128 | 512 | 256 | 1024 | 1024 | 0.1 | 28.65 ± 0.14 | 22.46 ± 0.20 | |
| W1 | 128 | 512 | 256 | 1024 | 1024 | 0.3 | 25.49 ± 0.25 | 19.81 ± 0.22 |
| 256 | 1024 | 128 | 512 | 256 | 0.1 | 27.98 ± 0.59 | 21.95 ± 0.45 | |
| 256 | 1024 | 128 | 512 | 256 | 0.3 | 27.85 ± 0.21 | 21.71 ± 0.19 | |
| 256 | 1024 | 128 | 512 | 1024 | 0.1 | 29.35 ± 0.52 | 23.20 ± 0.54 | |
| 256 | 1024 | 128 | 512 | 1024 | 0.3 | 27.73 ± 0.61 | 21.32 ± 0.54 | |
| 256 | 1024 | 256 | 1024 | 256 | 0.1 | 28.26 ± 0.31 | 21.69 ± 0.28 | |
| 256 | 1024 | 256 | 1024 | 256 | 0.3 | 26.83 ± 0.43 | 21.16 ± 0.44 | |
| 256 | 1024 | 256 | 1024 | 1024 | 0.1 | 28.84 ± 0.26 | 22.17 ± 0.26 | |
| 256 | 1024 | 256 | 1024 | 1024 | 0.3 | 27.22 ± 0.55 | 21.27 ± 0.59 | |
| 128 | 512 | 128 | 512 | 256 | 0.1 | 21.58 ± 0.36 | 17.31 ± 0.29 | |
| 128 | 512 | 128 | 512 | 256 | 0.3 | 21.31 ± 0.14 | 16.39 ± 0.09 | |
| 128 | 512 | 128 | 512 | 1024 | 0.1 | 21.70 ± 0.39 | 16.89 ± 0.32 | |
| 128 | 512 | 128 | 512 | 1024 | 0.3 | 21.88 ± 0.32 | 17.10 ± 0.17 | |
| 128 | 512 | 256 | 1024 | 256 | 0.1 | 20.59 ± 0.24 | 16.77 ± 0.18 | |
| 128 | 512 | 256 | 1024 | 256 | 0.3 | 19.98 ± 0.19 | 16.19 ± 0.14 | |
| 128 | 512 | 256 | 1024 | 1024 | 0.1 | 21.12 ± 0.25 | 17.10 ± 0.22 | |
| W2 | 128 | 512 | 256 | 1024 | 1024 | 0.3 | 20.81 ± 0.30 | 16.47 ± 0.22 |
| 256 | 1024 | 128 | 512 | 256 | 0.1 | 21.60 ± 0.18 | 17.56 ± 0.13 | |
| 256 | 1024 | 128 | 512 | 256 | 0.3 | 20.65 ± 0.40 | 16.14 ± 0.25 | |
| 256 | 1024 | 128 | 512 | 1024 | 0.1 | 21.53 ± 0.25 | 17.49 ± 0.27 | |
| 256 | 1024 | 128 | 512 | 1024 | 0.3 | 21.46 ± 0.07 | 16.90 ± 0.09 | |
| 256 | 1024 | 256 | 1024 | 256 | 0.1 | 20.83 ± 0.16 | 17.11 ± 0.17 | |
| 256 | 1024 | 256 | 1024 | 256 | 0.3 | 20.47 ± 0.27 | 16.90 ± 0.24 | |
| 256 | 1024 | 256 | 1024 | 1024 | 0.1 | 21.48 ± 0.29 | 17.34 ± 0.24 | |
| 256 | 1024 | 256 | 1024 | 1024 | 0.3 | 20.60 ± 0.20 | 16.41 ± 0.09 | |
| 128 | 512 | 128 | 512 | 256 | 0.1 | 20.01 ± 0.16 | 16.22 ± 0.14 | |
| 128 | 512 | 128 | 512 | 256 | 0.3 | 20.78 ± 0.14 | 16.44 ± 0.09 | |
| 128 | 512 | 128 | 512 | 1024 | 0.1 | 20.26 ± 0.13 | 16.52 ± 0.22 | |
| 128 | 512 | 128 | 512 | 1024 | 0.3 | 20.78 ± 0.09 | 16.39 ± 0.06 | |
| 128 | 512 | 256 | 1024 | 256 | 0.1 | 20.06 ± 0.14 | 16.48 ± 0.13 | |
| 128 | 512 | 256 | 1024 | 256 | 0.3 | 19.04 ± 0.18 | 15.66 ± 0.12 | |
| 128 | 512 | 256 | 1024 | 1024 | 0.1 | 20.19 ± 0.27 | 16.49 ± 0.24 | |
| W2+R2 | 128 | 512 | 256 | 1024 | 1024 | 0.3 | 19.54 ± 0.20 | 16.15 ± 0.08 |
| 256 | 1024 | 128 | 512 | 256 | 0.1 | 20.20 ± 0.24 | 16.69 ± 0.21 | |
| 256 | 1024 | 128 | 512 | 256 | 0.3 | 20.02 ± 0.13 | 16.12 ± 0.13 | |
| 256 | 1024 | 128 | 512 | 1024 | 0.1 | 20.49 ± 0.32 | 16.68 ± 0.32 | |
| 256 | 1024 | 128 | 512 | 1024 | 0.3 | 20.35 ± 0.18 | 16.25 ± 0.14 | |
| 256 | 1024 | 256 | 1024 | 256 | 0.1 | 19.73 ± 0.19 | 16.24 ± 0.16 | |
| 256 | 1024 | 256 | 1024 | 256 | 0.3 | 19.57 ± 0.22 | 16.40 ± 0.19 | |
| 256 | 1024 | 256 | 1024 | 1024 | 0.1 | 20.28 ± 0.30 | 16.55 ± 0.25 | |
| 256 | 1024 | 256 | 1024 | 1024 | 0.3 | 19.96 ± 0.08 | 16.43 ± 0.10 | |
| 128 | 512 | 128 | 512 | 256 | 0.1 | 19.83 ± 0.21 | 16.25 ± 0.23 | |
| 128 | 512 | 128 | 512 | 256 | 0.3 | 20.92 ± 0.06 | 16.53 ± 0.02 | |
| 128 | 512 | 128 | 512 | 1024 | 0.1 | 20.07 ± 0.11 | 16.42 ± 0.08 | |
| 128 | 512 | 128 | 512 | 1024 | 0.3 | 20.91 ± 0.07 | 16.70 ± 0.06 | |
| 128 | 512 | 256 | 1024 | 256 | 0.1 | 19.68 ± 0.19 | 16.70 ± 0.17 | |
| 128 | 512 | 256 | 1024 | 256 | 0.3 | 19.58 ± 0.05 | 16.02 ± 0.08 | |
| 128 | 512 | 256 | 1024 | 1024 | 0.1 | 19.30 ± 0.17 | 16.16 ± 0.17 | |
| W2+R2+R1 | 128 | 512 | 256 | 1024 | 1024 | 0.3 | 19.86 ± 0.12 | 16.26 ± 0.09 |
| 256 | 1024 | 128 | 512 | 256 | 0.1 | 19.95 ± 0.23 | 16.28 ± 0.08 | |
| 256 | 1024 | 128 | 512 | 256 | 0.3 | 20.12 ± 0.16 | 16.25 ± 0.06 | |
| 256 | 1024 | 128 | 512 | 1024 | 0.1 | 20.14 ± 0.22 | 16.38 ± 0.25 | |
| 256 | 1024 | 128 | 512 | 1024 | 0.3 | 20.55 ± 0.12 | 16.58 ± 0.13 | |
| 256 | 1024 | 256 | 1024 | 256 | 0.1 | 20.16 ± 0.11 | 16.77 ± 0.07 | |
| 256 | 1024 | 256 | 1024 | 256 | 0.3 | 19.51 ± 0.15 | 16.30 ± 0.12 | |
| 256 | 1024 | 256 | 1024 | 1024 | 0.1 | 19.96 ± 0.11 | 16.62 ± 0.08 | |
| 256 | 1024 | 256 | 1024 | 1024 | 0.3 | 19.51 ± 0.17 | 16.26 ± 0.15 |
Appendix B
Table A2.
The list of the 100 most frequent mismatches between the transcription reference and prediction, along with their categorization based on the gold standard. The symbols 0, s, m, and l denote a match, small mismatch, medium mismatch, and large mismatch, respectively.
Table A2.
The list of the 100 most frequent mismatches between the transcription reference and prediction, along with their categorization based on the gold standard. The symbols 0, s, m, and l denote a match, small mismatch, medium mismatch, and large mismatch, respectively.
| Occ. | Original | Reference | Prediction | Golden Standard | Ref. Cat | Pred. Cat. |
|---|---|---|---|---|---|---|
| 9 | Alexį Tsiprą | Alèksį Cìprą | Ãleksį Cìprą | 0 | m | |
| 5 | Alexio Tsipro | Alèksio Tsìpro | Alèksijo Cìpro | Alèksio Cìpro | s | s |
| 60 | Alexio Tsipro | Alèksio Cìpro | Alèksijo Cìpro | 0 | s | |
| 19 | Alexis Tsipras | Alèksis Tsìpras | Alèksis Cìpras | s | 0 | |
| 5 | Ali Zeidanas | Ãli Zeidãnas | Alì Zeidãnas | Ãli Zeĩdenas | m | m |
| 6 | Amy Poehler | Eĩmi Poũler | Eĩmi Pèler | 0 | l | |
| 6 | Andrea Bocelli | Andrèa Bočèli | Andrė̃ja Bočèli | 0 | s | |
| 14 | Anna Wintour | Ãna Viñtur | Ãna Vintùr | 0 | s | |
| 11 | Bambangas Soelistyo | Bambángas Sulìstjo | Bambángas Soelìsto | Bembéngas Sulìstjo | m | l |
| 10 | Blake Lively | Bleĩk Láivli | Bleĩk Lìvli | 0 | l | |
| 6 | Brendan Gilligan | Bréndan Gìligan | Bréndan Džìligan | 0 | l | |
| 9 | Buzz Aldrin | Bãz Òldrin | Bùz Áldrin | Bàz Òldrin | s | l |
| 5 | Buzzas Aldrinas | Bãzas Òldrinas | Bùzas Al̃drinas | Bàzas Òldrinas | s | l |
| 14 | Carlos Delfino | Kárlos Delfìno | Kárlos Del̃fino | 0 | m | |
| 9 | Caroline Kennedy | Kèrolain Kènedi | Karolìn Kènedi | 0 | l | |
| 6 | Carrie Prejean | Kèri Preidžán | Kèri Prežán | m | 0 | |
| 6 | Chris Cassidy | Krìs Kãsidi | Krìs Kẽsidi | m | 0 | |
| 10 | Cindy Crawford | Siñdi Kráuford | Siñdi Kroũford | Siñdi Krõford | s | s |
| 13 | Cindy Crawford | Siñdi Kròford | Siñdi Kroũford | Siñdi Krõford | s | s |
| 8 | David Lynch | Deĩvid Liñč | Deĩvid Lỹnč | 0 | s | |
| 10 | Dilmai Rousseff | Dìlmai Rùsef | Dil̃mai Rùsef | Dil̃mai Rusèf | s | s |
| 101 | Donald Tusk | Dònald Tùsk | Dònald Tãsk | 0 | l | |
| 5 | Donaldas Tuskas | Dònaldas Tùskas | Dònaldas Tãskas | 0 | l | |
| 5 | Ene Ergma | Èn Èrgma | Èn Er̃gma | Ène Èrgma | l | l |
| 5 | Geir Lundestad | Geĩr Luñdestad | Geĩr Liùndestad | 0 | s | |
| 6 | Yingluck Shinawatra | Jiñglak Činavãta | Iñglak Šinavãtra | Jiñglak Šinavãtra | l | s |
| 15 | Yingluck Shinawatra | Jinglùk Šinavãtra | Iñglak Šinavãtra | Jiñglak Šinavãtra | m | s |
| 5 | Yingluck Shinawatros | Jiñglak Činavãtos | Iñglak Šinavãtos | Jiñglak Šinavãtros | l | l |
| 5 | Yukio Edano | Jùkio Edãno | Jùkijo Edãno | 0 | 0 | |
| 8 | Jean Sibelius | Ján Sibèlijus | Žán Sibèlijus | l | 0 | |
| 5 | Jeanui Monnet | Žãnui Monè | Žãnui Monė̃ | s | 0 | |
| 8 | Jennifer Hudson | Džènifer Hãdson | Džènifer Hàdson | s | 0 | |
| 6 | Jerry Rubin | Džèri Rùbin | Džèri Rãbin | 0 | l | |
| 5 | Jiroemonas Kimura | Džiroemònas Kimùra | Žirumònas Kimùra | 0 | l | |
| 8 | Joakim Noah | Žoakìm Nòa | Joakìm Nòa | Džoũakim Noũa | l | l |
| 8 | Johnas Kirby | Džònas Kir̃bi | Džònas Ker̃bi | s | 0 | |
| 10 | Johno Kirby | Džòno Ker̃bi | Džòno Kir̃bi | 0 | s | |
| 14 | Jose Mujica | Chosė̃ Muchìka | Chosė̃ Mužìka | 0 | l | |
| 6 | Josephas Muscatas | Džòzefas Muskãtas | Džòzefas Maskãtas | Džoũzefas Maskãtas | m | s |
| 5 | Juan Carlos | Chuán Kárl | Chuán Kárlos | l | 0 | |
| 6 | Kei Nishikori | Kèi Nišikòri | Keĩ Nišikòri | 0 | 0 | |
| 8 | Kemalis Kilicdaroglu | Kemãlis Kiličdaròhlu | Kemãlis Kilikdaròhlu | 0 | l | |
| 5 | Kenneth Campbell | Kènet Kémbel | Kènet Kémpbel | 0 | s | |
| 6 | Kerem Gonlum | Kerèm Geñlium | Kerèm Gònlam | 0 | l | |
| 17 | Kianoushas Jahanpouras | Ki-ãnušas Džahanpū̃ras | Ki-ãnušas Džahanpùras | Ki-anùšas Džahanpū̃ras | s | m |
| 10 | Konrad Adenauer | Kònrad Ãdenauer | Kònrad Adenáuer | 0 | s | |
| 5 | Kurt Vonnegut | Kùrt Vònegut | Kùrt Fònegut | Kùrt Vãnegat | m | l |
| 12 | Lance Stephenson | Leñs Stèfenson | Leñs Stìvenson | Leñs Stỹvenson | l | s |
| 14 | Laurent Gbagbo | Lòren Gbãgbo | Lorán Gbãgbo | Lorán Gbagbò | s | s |
| 9 | Laurent’as Fabiusas | Lorãnas Fãbijusas | Lorãnas Fabiùsas | s | 0 | |
| 21 | Lech Walesa | Lèch Valènsa | Lèch Valèsa | 0 | l | |
| 5 | Lechą Walesą | Lèchą Valènsą | Lèchą Valèsą | 0 | l | |
| 30 | Lechas Walesa | Lèchas Valènsa | Lèchas Valèsa | 0 | l | |
| 10 | Lecho Walesos | Lècho Valènsos | Lècho Valèsos | 0 | l | |
| 6 | Mariah Carey | Marãja Kèri | Marìja Kèri | Merãja Kèri | s | l |
| 18 | Mariah Carey | Merãja Kèri | Marìja Kèri | 0 | l | |
| 12 | Marie Trintignant | Marì Trentinján | Marỹ Trintinján | Marỹ Trentinján | s | s |
| 6 | Marisol Touraine | Marizòl Tureñ | Marisòl Turèn | m | 0 | |
| 6 | Marissa Mayer | Marìsa Mèjer | Marìsa Mãjer | s | 0 | |
| 10 | Marlon Brando | Mar̃lon Brándo | Mar̃lon Breñdo | Márlon Bréndou | s | s |
| 11 | Michel Hazanavicius | Mišèl Hazanãvičius | Mišèl Azanãvičius | s | 0 | |
| 6 | Milton Friedman | Mil̃ton Friẽdman | Mil̃ton Frìdman | Mil̃ton Frỹdmen | l | m |
| 6 | Milton Friedman | Mil̃ton Frỹdman | Mil̃ton Frìdman | Mil̃ton Frỹdmen | s | m |
| 16 | Monta Ellis | Mònta Èlis | Mònta Ẽlis | Mòntei Èlis | s | m |
| 6 | Navi Pillay | Nãvi Piláj | Nãvi Pilái | Nãvi Pìlei | m | m |
| 23 | Nene Hilario | Nèn Ilãrijo | Nèn Hìlario | Nenè Ilãriju | l | l |
| 10 | Nicki Minaj | Nìki Minãdž | Nìki Minãj | Nìki Minãž | l | l |
| 65 | Oprah Winfrey | Òpra Vìnfri | Òpra Viñfri | Òupra Viñfri | m | m |
| 7 | Peter Hess | Pẽter Hès | Pìter Hès | m | 0 | |
| 6 | Peteris Altmaieris | Pė̃teris Áltmajeris | Pė̃teris Altmãjeris | 0 | s | |
| 6 | Peteris Lerneris | Pìteris Ler̃neris | Pė̃teris Ler̃neris | 0 | l | |
| 7 | Peteris Szijjarto | Pė̃teris Sìjarto | Pė̃teris Šidžárto | 0 | l | |
| 6 | Pol Pot | Pòl Pòt | Põl Pòt | 0 | s | |
| 7 | Prabowo Subianto | Prãbovo Subi-ánto | Prabòvo Subi-ánto | s | 0 | |
| 5 | Prosper Merimee | Pròsper Merìm | Pròsper Merìmi | Prospèr Merimė̃ | l | l |
| 10 | Raffaele Sollecito | Rafaèl Solečìto | Rafaèl Solesìto | Rafaèle Solèčito | m | l |
| 6 | Ralphas Fiennesas | Rálfas Fáinsas | Rálfas Fi-ènesas | Reĩfas Fáinzas | l | l |
| 10 | Ryan Toolson | Raján Tū̃lson | Raján Tùlson | Rãjan Tū̃lson | m | m |
| 18 | Sabine Kehm | Sabìn Kė̃m | Sabìn Kèm | Zabỹne Kė̃m | l | l |
| 6 | Sakellie Daniels | Sakelì Dẽni-els | Sakelì Dãni-els | Sakèli Dẽni-els | s | s |
| 5 | Salma Hayek | Sèlma Hãjek | Sálma Hãjek | 0 | m | |
| 6 | Salvador Allende | Salvadòr Aljènd | Salvadòr Alènd | Salvadòr Aljènde | l | l |
| 4 | Salvadoras Allende | Salvadòras Aljènd | Salvadòras Alènd | Salvadòras Aljènde | l | l |
| 5 | Samantha Murray | Samánta Miurė̃j | Samánta Mùrėj | 0 | m | |
| 19 | Sergio Mattarella | Ser̃džijo Matarèla | Ser̃chijo Matarèla | 0 | l | |
| 4 | Shuji Nakamura | Šiùdži Nakamùra | Šùdži Nakamùra | s | 0 | |
| 6 | Syd Barrett | Sìd Bãret | Sáid Bãret | Sìd Bẽret | s | l |
| 9 | Silvio Berlusconi | Sìlvijo Berluskòni | Sil̃vijo Berluskòni | 0 | 0 | |
| 5 | Simon Frekley | Sáimon Frìkli | Sáimon Frèkli | s | 0 | |
| 4 | Stephenas Mullas | Stìvenas Mãlas | Stìvenas Màlas | Stỹvenas Màlas | s | s |
| 9 | Steve Wozniak | Stìv Vòzniak | Stỹv Vòzniak | s | 0 | |
| 7 | Steven Theede | Stỹven Tỹd | Stìven Tìd | 0 | s | |
| 4 | Taneras Yildizas | Tãneras Jildìzas | Tanèras Jildìzas | 0 | s | |
| 14 | Thomas Hobbes | Tòmas Hòbs | Tòmas Hòbes | Tòmas Hòbz | s | l |
| 19 | Timothy Geithner | Tìmoti Gáitner | Tìmoti Geĩtner | 0 | l | |
| 13 | Valerie Trierweiler | Valerì Trìjerveler | Valerì Trỹrvailer | Valerỹ Trijervailèr | m | m |
| 28 | Valerie Trierweiler | Valerì Trirveĩler | Valerì Trỹrvailer | Valerỹ Trijervailèr | m | m |
| 4 | Vitaly Kamluk | Vitãli Kamliùk | Vitãli Kamlùk | 0 | 0 | |
| 9 | Woodrow Wilson | Vùdrov Vìlson | Vùdrou Vìlson | s | 0 | |
| 5 | Woodrow Wilsono | Vùdrau Vìlsono | Vùdrou Vìlsono | s | 0 |
References
- McArthur, T. Transliteration. In Concise Oxford Companion to the English Language; Oxford University Press: Oxford, UK, 2018; Available online: https://www.encyclopedia.com/humanities/encyclopedias-almanacs-transcripts-and-maps/transliteration (accessed on 7 May 2025).
- Superanskaja, A.V. Teoreticheskie Osnovy Prakticheskoj Transkripcii [Theoretical Foundations of Practical Transcription], 2nd ed.; LENAND: Moscow, Russia, 2018; pp. 10–40. Available online: https://archive.org/details/raw-..-2018/page/1/mode/2up (accessed on 7 May 2025). (In Russian)
- Lui, M.; Baldwin, T. langid.py: An off-the-shelf language identification tool. In Proceedings of the ACL 2012 System Demonstrations, Jeju Island, Republic of Korea, 9–11 July 2012; pp. 25–30. [Google Scholar]
- Joulin, A.; Grave, E.; Bojanowski, P.; Mikolov, T. Bag of tricks for efficient text classification. arXiv 2017, arXiv:1607.01759. Available online: https://arxiv.org/abs/1607.01759 (accessed on 7 May 2025).
- Google. Compact Language Detector v3 (CLD3). Available online: https://github.com/google/cld3 (accessed on 7 May 2025).
- Papariello, L. XLM-Roberta-Base Language Detection. Hugging Face. 2021. Available online: https://huggingface.co/papluca/xlm-roberta-base-language-detection (accessed on 7 May 2025).
- Apple. Language Identification from Very Short Strings. Apple Machine Learning Research. 2019. Available online: https://machinelearning.apple.com/research/language-identification-from-very-short-strings (accessed on 7 May 2025).
- Toftrup, M.; Sørensen, S.A.; Ciosici, M.R.; Assent, I. A reproduction of Apple’s bi-directional LSTM models for language identification. arXiv 2021, arXiv:2102.06282. Available online: https://arxiv.org/abs/2102.06282 (accessed on 24 April 2025).
- Moillic, J.; Ismail Fawaz, H. Language Identification for Very Short Texts: A review. Medium. 2022. Available online: https://medium.com/besedo-engineering/language-identification-for-very-short-texts-a-review-c9f2756773ad (accessed on 7 May 2025).
- Kostelac, M. Comparison of Language Identification Models. ModelPredict. 2021. Available online: https://modelpredict.com/language-identification-survey (accessed on 7 May 2025).
- International Phonetic Association. Handbook of the International Phonetic Association: A Guide to the Use of the International Phonetic Alphabet; Cambridge University Press: Cambridge, UK, 1999. [Google Scholar] [CrossRef]
- OpenAI. GPT-4 Technical Report. 2023. Available online: https://arxiv.org/pdf/2303.08774 (accessed on 24 April 2025).
- Wei, J.; Wang, X.; Schuurmans, D.; Bosma, M.; Ichter, B.; Xia, F.; Chi, E.; Le, Q.; Zhou, D. Finetuned language models are zero-shot learners. arXiv 2021, arXiv:2109.01652. Available online: https://arxiv.org/abs/2109.01652 (accessed on 7 May 2025).
- Ouyang, L.; Wu, J.; Jiang, X.; Almeida, D.; Wainwright, C.; Mishkin, P.; Zhang, C.; Agarwal, S.; Slama, K.; Ray, A.; et al. Training language models to follow instructions with human feedback. arXiv 2022, arXiv:2203.02155. Available online: https://arxiv.org/abs/2203.02155 (accessed on 7 May 2025).
- Zhang, S.; Liang, Y.; Shin, R.; Chen, M.; Du, Y.; Li, X.; Ram, A.; Zhang, Y.; Ma, T.; Finn, C. Instruction tuning for large language models: A survey. arXiv 2023, arXiv:2308.10792. Available online: https://arxiv.org/abs/2308.10792 (accessed on 7 May 2025).
- Ainsworth, W. A system for converting English text into speech. IEEE Trans. Audio Electroacoust. 1973, 21, 288–290. [Google Scholar] [CrossRef]
- Elovitz, H.; Johnson, R.; McHugh, A.; Shore, J. Letter-to-sound rules for automatic translation of English text to phonetics. IEEE Trans. Acoust. Speech Signal Process. 1976, 24, 446–459. [Google Scholar] [CrossRef]
- Divay, M.; Vitale, A.J. Algorithms for grapheme-phoneme translation for English and French: Applications for database searches and speech synthesis. Comput. Linguist. 1997, 23, 495–523. [Google Scholar]
- Damper, R.I.; Eastmond, J.F. A comparison of letter-to-sound conversion techniques for English text-to-speech synthesis. Comput. Speech Lang. 1997, 11, 33–73. [Google Scholar]
- Finch, A.; Sumita, E. Phrase-based machine transliteration. In Proceedings of the Workshop on Technologies and Corpora for Asia-Pacific Speech Translation, Hyderabad, India, 11 January 2008. [Google Scholar]
- Sutskever, I.; Vinyals, O.; Le, Q.V. Sequence to sequence learning with neural networks. In Proceedings of the NeurIPS 2014, Montreal, Canada, 8–11 December 2014. [Google Scholar]
- Bahdanau, D.; Cho, K.; Bengio, Y. Neural machine translation by jointly learning to align and translate. In Proceedings of the ICLR 2015, San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
- Gehring, J.; Auli, M.; Grangier, D.; Dauphin, Y.N. Convolutional sequence to sequence learning. In Proceedings of the ICML 2017, Sydney, Australia, 6–11 August 2017. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the NeurIPS 2017, Long Beach, CA, USA, 4–9 December 2017; pp. 6000–6010. [Google Scholar]
- Johnson, M.; Schuster, M.; Le, Q.V.; Krikun, M.; Wu, Y.; Chen, Z.; Thorat, N.; Viégas, F.; Wattenberg, M.; Corrado, G.; et al. Google’s multilingual neural machine translation system: Enabling zero-shot translation. Trans. Assoc. Comput. Linguist. 2017, 5, 339–351. [Google Scholar] [CrossRef]
- Conneau, A.; Lample, G.; Ranzato, M.; Denoyer, L.; Jégou, H. Unsupervised cross-lingual representation learning at scale. In Proceedings of the ACL 2020, Online, 5–10 July 2020; pp. 8440–8451. Available online: https://aclanthology.org/2020.acl-main.747/ (accessed on 7 May 2025).
- Luong, M.T.; Pham, H.; Manning, C.D. Effective approaches to attention-based neural machine translation. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Lisbon, Portugal, 17–21 September 2015; pp. 1412–1421. [Google Scholar]
- Cotterell, R.; Kirov, C.; Sylak-Glassman, J.; Walther, G.; Vylomova, E.; McCarthy, A.D.; Kann, K.; Mielke, S.J.; Nicolai, G.; Silfverberg, M.; et al. The CoNLL–SIGMORPHON 2018 Shared Task: Universal Morphological Reinflection. In Proceedings of the CoNLL–SIGMORPHON 2018 Shared Task, Brussels, Belgium, 31 October–1 November 2018; pp. 1–27. [Google Scholar]
- Wu, S.; Cotterell, R.; Hulden, M. Applying the transformer to character-level transduction. arXiv 2020, arXiv:2005.10213. [Google Scholar]
- Gorman, K.; Ashby, L.F.E.; Goyzueta, A.; McCarthy, A.; Wu, S.; You, D. The SIGMORPHON 2020 Shared Task on Multilingual Grapheme-to-Phoneme Conversion. In Proceedings of the 17th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology, Online, 10 July 2020; pp. 40–50. [Google Scholar]
- Raškinis, G. Transliteration List of Foreign Person Names into Lithuanian v.1; CLARIN-LT: Kaunas, Lithuania, 2025; Available online: http://hdl.handle.net/20.500.11821/68 (accessed on 7 May 2025).
- Norkevičius, G.; Raškinis, G.; Kazlauskienė, A. Knowledge-based grapheme-to-phoneme conversion of Lithuanian words. In Proceedings of the SPECOM 2005, 10th International Conference Speech and Computer, Patras, Greece, 17–19 October 2005; pp. 235–238. [Google Scholar]
- Kazlauskienė, A.; Raškinis, G.; Vaičiūnas, A. Automatinis Lietuvių Kalbos žodžių Skiemenavimas, Kirčiavimas, Transkribavimas [Automatic Syllabification, Stress Assignment and Phonetic Transcription of Lithuanian Words]; Vytautas Magnus University: Kaunas, Lithuania, 2010; Available online: https://hdl.handle.net/20.500.12259/254 (accessed on 15 April 2025). (In Lithuanian)
- Kirčiuoklis—A Tool for Placing Stress Marks on Lithuanian Words. Available online: https://kalbu.vdu.lt/mokymosi-priemones/kirciuoklis/ (accessed on 24 April 2025).
- Novak, J.R.; Minematsu, N.; Hirose, K. Phonetisaurus: Exploring grapheme-to-phoneme conversion with joint n-gram models in the WFST framework. Nat. Lang. Eng. 2016, 22, 907–938. [Google Scholar] [CrossRef]
- Taylor, P. Hidden Markov models for grapheme to phoneme conversion. In Proceedings of the Interspeech 2005, Lisbon, Portugal, 4–8 September 2005; pp. 1973–1976. [Google Scholar] [CrossRef]
- Lee, J.L.; Ashby, L.F.E.; Garza, M.E.; Lee-Sikka, Y.; Miller, S.; Wong, A.; McCarthy, A.D.; Gorman, K. Massively multilingual pronunciation mining with WikiPron. In Proceedings of the 12th Language Resources and Evaluation Conference, Marseille, France, 11–16 May 2020; pp. 4216–4221. [Google Scholar]
- Viterbi, A.J. Error bounds for convolutional codes and an asymptotically optimum decoding algorithm. IEEE Trans. Inf. Theory 1967, 13, 260–269. [Google Scholar] [CrossRef]
- Roark, B.; Sproat, R.; Allauzen, C.; Riley, M.; Sorensen, J.; Tai, T. The OpenGrm open-source finite-state grammar software libraries. In Proceedings of the ACL 2012 System Demonstrations, Jeju Island, Republic of Korea, 9–11 July 2012; pp. 61–66. [Google Scholar]
- Allauzen, C.; Riley, M.; Schalkwyk, J.; Skut, W.; Mohri, M. OpenFst: A general and efficient weighted finite-state transducer library. In Proceedings of the CIAA, Prague, Czech Republic, 16–18 July 2007; pp. 11–23. [Google Scholar]
- Gorman, K. Pynini: A Python library for weighted finite-state grammar compilation. In Proceedings of the SIGFSM Workshop on Statistical NLP and Weighted Automata, Berlin, Germany, 12 August 2016; pp. 75–80. [Google Scholar]
- Ott, M.; Edunov, S.; Baevski, A.; Fan, A.; Gross, S.; Ng, N.; Grangier, D.; Auli, M. Fairseq: A fast, extensible toolkit for sequence modeling. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations), Minneapolis, MN, USA, 2–7 June 2019; pp. 48–53. [Google Scholar]
- Szegedy, C.; Vanhoucke, V.; Ioffe, S.; Shlens, J.; Wojna, Z. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 2818–2826. [Google Scholar]
- Song, H.; Kim, M.; Park, D.; Shin, J. Learning from noisy labels with deep neural networks: A survey. arXiv 2020, arXiv:2007.08199. Available online: https://arxiv.org/abs/2007.08199 (accessed on 7 May 2025). [CrossRef] [PubMed]
- Sohn, K.; Berthelot, D.; Li, C.; Zhang, Z.; Carlini, N.; Cubuk, E.D.; Kurakin, A.; Zhang, H.; Raffel, C. FixMatch: Simplifying semi-supervised learning with consistency and confidence. arXiv 2020, arXiv:2001.07685. Available online: https://arxiv.org/abs/2001.07685 (accessed on 7 May 2025).
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).