Next Article in Journal
M-Learning: Heuristic Approach for Delayed Rewards in Reinforcement Learning
Previous Article in Journal
An Analytical Formula for the Transition Density of a Conic Combination of Independent Squared Bessel Processes with Time-Dependent Dimensions and Financial Applications
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A Semi-Automatic Framework for Practical Transcription of Foreign Person Names in Lithuanian

by
Gailius Raškinis
1,2,
Darius Amilevičius
1,2,
Danguolė Kalinauskaitė
1,
Artūras Mickus
1,
Daiva Vitkutė-Adžgauskienė
1,
Antanas Čenys
3 and
Tomas Krilavičius
1,*
1
Faculty of Informatics, Vytautas Magnus University, Universiteto str. 10–202, Kaunas District, 53361 Akademija, Lithuania
2
Institute of Digital Resources and Interdisciplinary Research, Vytautas Magnus University, S. Daukanto str. 27, 44249 Kaunas, Lithuania
3
Department of Information Systems, Vilnius Gediminas Technical University, 10223 Vilnius, Lithuania
*
Author to whom correspondence should be addressed.
Mathematics 2025, 13(13), 2107; https://doi.org/10.3390/math13132107
Submission received: 8 May 2025 / Revised: 16 June 2025 / Accepted: 23 June 2025 / Published: 27 June 2025
(This article belongs to the Section E1: Mathematics and Computer Science)

Abstract

We present a semi-automatic framework for transcribing foreign personal names into Lithuanian, aimed at reducing pronunciation errors in text-to-speech systems. Focusing on noisy, web-crawled data, the pipeline combines rule-based filtering, morphological normalization, and manual stress annotation—the only non-automated step—to generate training data for character-level transcription models. We evaluate three approaches: a weighted finite-state transducer (WFST), an LSTM-based sequence-to-sequence model with attention, and a Transformer model optimized for character transduction. Results show that word-pair models outperform single-word models, with the Transformer achieving the best performance (19.04% WER) on a cleaned and augmented dataset. Data augmentation via word order reversal proved effective, while combining single-word and word-pair training offered limited gains. Despite filtering, residual noise persists, with 54% of outputs showing some error, though only 11% were perceptually significant.

1. Introduction

Monolingual text-to-speech (TTS) systems often face difficulties in accurately pronouncing text segments originating from languages other than the system’s target language. Personal names, geographical names, organizations, and other foreign-named entities are major sources of pronunciation errors in TTS. Since foreign words follow different grapheme-to-phoneme (G2P) conversion rules than those of the target language, they must first be identified within the text. Their orthography should then be modified in a way that enables the embedded G2P module to approximate their pronunciation using the phonemic inventory of the target language. This process is referred to as practical transcription [1,2].
This paper focuses on the process of building a data processing pipeline that takes data found on the web as an input and results in the computational models performing automatic transcription of personal names of foreign origin into Lithuanian, which serves as the target language. Several challenges complicate this task: (1) the variability in how initial morphological adaptation of foreign names is already performed in texts, (2) the need to infer both the position and the type of word stress, and (3) the requirement to consider broader linguistic context—extending beyond individual words—during decision-making.
Foreign personal names often undergo primary morphological adaptation to conform to Lithuanian grammatical structures. This is typically achieved by modifying the name and/or appending Lithuanian inflectional suffixes. The State Commission of the Lithuanian Language (State Commission of the Lithuanian Language is the body that issues normative recommendations for the usage of Lithuanian in public, see https://vlkk.lt/vlkk-nutarimai/protokoliniai-nutarimai/rekomendacija-del-autentisku-asmenvardziu-gramatinimo, accessed on 25 November 2024) prescribes different adaptation strategies depending on factors such as the source language, the gender of the person, and the morphological structure of the name. Consequently, texts processed by TTS systems may contain unaltered foreign names (e.g., Silverstone), names appended with Lithuanian suffixes (e.g., Alastairis, Bruce’as), or names transformed by fusion with Lithuanian morphological endings (e.g., Alicios, the genitive form of Alicia). The latter types of names are no longer tokens in original language but mixed-language tokens. Despite these modifications, we refer to all such names as “original names,” as they appear in their raw form from the perspective of the TTS system.
Word stress is a critical perceptual feature in spoken Lithuanian, as it enhances the prominence of the stressed syllable. Lithuanian distinguishes between acute and circumflex accents, marked with specific diacritics over the stressed sound. However, such stress markings are generally absent from standard orthography and are only used in specialized texts where correct pronunciation is essential. This presents an additional challenge for transcription models, which must not only rewrite the text to align with Lithuanian G2P rules but also predict the correct stress position and accent type.
A further complication arises from the fact that transcription is often ambiguous or exhibits one-to-many mappings, particularly when the origin language of a name is unknown. For example, the name Charles may be transcribed as Čárlzas if the individual is English, or Šárlis if French. Similarly, Michael may become Máiklas (English) or Michaèlis (German). One approach to resolving such ambiguities is to incorporate a broader linguistic context—potentially involving multiple adjacent foreign words—into the transcription process.
Several approaches can be considered to address the multilingual proper name transcription problem, which we categorize as follows: (1) the multi-stage approach, (2) the Generative AI approach, and (3) the end-to-end machine learning approach.

1.1. Multi-Stage Approach

In the multi-stage approach, the transcription task is divided into several sequential steps: (i) identification of the source language of loanwords, (ii) text rewriting based on the identified source language, and (iii) stress placement. This approach has certain advantages. First, the initial step can be framed as a variant of the well-established language identification (LangID) task, for which numerous methods exist, including Naïve Bayes trained on n-grams [3], using sub-word features and vector quantization [4], neural networks [5], transformers [6], and LSTM-based models [7,8]. Second, the rewriting step can potentially achieve high accuracy if it leverages a comprehensive set of deterministic rules for adapting foreign names in the target language.
However, this approach also has significant limitations:
  • Proper names are very short text segments, and LangID performance is known to degrade on short text segments [9,10].
  • The TTS system needs to identify and to transcribe mixed-language tokens. This poses a challenge, as no off-the-shelf LangID tools are designed to handle such inputs. Addressing this requires developing a specialized LangID system, for which neither pre-defined rules nor training data currently exist.
  • For many languages, including Lithuanian, no complete and explicit rule set exists for adapting foreign names—only broad linguistic guidelines are available. Constructing such a rule set would require substantial linguistic expertise and manual effort.
  • Error propagation is inherent to multi-stage approaches: mistakes in earlier stages (e.g., misidentified language) are likely to impact subsequent steps, thereby degrading the overall system performance.

1.2. Generative AI Approach

In the Generative AI approach, the transcription task is divided into two stages: (i) querying a large language model (LLM) to generate the International Phonetic Alphabet [11] (IPA) transcription of a given personal name, and (ii) converting the resulting IPA representation into Lithuanian orthography using a specialized IPA-to-grapheme converter. The main advantages of this approach include the wide availability of pre-trained LLMs and the relative ease of developing an IPA-to-grapheme converter, which can be based on accurate and deterministic phoneme-to-letter conversion rules.
Table 1 presents several examples of original names and their corresponding IPA transcriptions, as generated by GPT-4o [12], in response to queries such as “What is the IPA pronunciation of [person’s name]?” and “Please transcribe the following name in IPA symbols: [person’s name].”
As seen in rows 1–8, the LLM generally provides accurate transcriptions for unmodified personal names, regardless of the language of origin. However, for mixed-language tokens that have already undergone some degree of morphological adaptation—common in Lithuanian texts—such as in rows 9 and 10, the model fails to produce correct IPA forms (the acceptable transcriptions of Laurent’as Fabiusas and Ralphas Fiennesas are /lʲɔˈra:nɐs fɐˈbʲʊsɐs/ /2ˈreɪfɐs ˈfaɪnzɐs/, respectively). This limitation indicates that pre-trained LLMs alone may not be sufficient for the transcription task, particularly when handling non-standard forms. Fine-tuning [13] or instruction-tuning LLMs [14,15] could potentially improve their performance on the task in question.

1.3. End-to-End Machine Learning Approach

In the end-to-end machine learning approach, the transcription task is performed in a single step, following a preparatory (offline) phase in which a model is trained on a dataset of input–output pairs—original names and their corresponding transcriptions. The trained model effectively encodes a deterministic rule set that implicitly integrates language identification, text rewriting, and word stress placement.
The primary advantage of this approach is that, once trained, the model can be directly applied to previously unseen inputs with minimal additional processing. However, there are several disadvantages: (i) substantial effort may be required to compile a high-quality, representative, and sufficiently large training dataset; (ii) the learned rules are typically encoded in model parameters (e.g., neural network weights), which are not easily interpretable or modifiable; and (iii) any errors or biases in the training data may be propagated throughout the model’s predictions, with limited opportunities for targeted correction.
While acknowledging the validity of the first two approaches described in Section 1.2 and Section 1.3, we adopt the latter approach and formulate transcription as a supervised end-to-end machine learning problem. This choice is based on the assumption that it is feasible to collect a sufficiently large and representative dataset of personal names—originating from various source languages—and their corresponding Lithuanian transcriptions from online sources. Our objective is to investigate methods for cleaning and structuring this data, as well as to train and evaluate different models capable of learning the mapping from original to transcribed forms.
This research seeks to address the following key questions:
  • Is it possible to develop a fully automated data processing pipeline that converts raw web-crawled data into a dataset of sufficient quality for training practical transcription models?
  • What level of transcription accuracy can be achieved using such automatically generated data, and how does this accuracy compare to that of human transcribers?
  • To what extent do models trained on word pairs outperform those trained on single-word inputs?
  • How effective are different data augmentation strategies in enhancing the performance of transcription models?
To the best of our knowledge, the task of training end-to-end models for the transcription of foreign words into Lithuanian has not been addressed in previous research.

1.4. Related Work

Early approaches to translation and transcription relied on handcrafted, context-dependent phonological rules informed by linguistic expertise [16,17,18]. While these methods offered transparency and control, they were labor-intensive and lacked robustness across language domains [19]. The shift toward data-driven techniques began with statistical machine translation models such as n-gram and maximum entropy models [20], which leveraged aligned parallel translation corpora. However, these approaches struggled in low-resource settings due to data sparsity.
Modern translation methods primarily utilize sequence-to-sequence (seq2seq) models, which revolutionized the handling of variable-length input and output sequences. The introduction of encoder–decoder architectures based on recurrent neural networks (RNNs) by Sutskever et al. [21] enabled end-to-end learning for machine translation tasks. The subsequent addition of attention mechanisms by Bahdanau et al. [22] significantly improved performance by allowing models to dynamically focus on relevant parts of the input during decoding.
Due to the limitations of RNNs in modeling long-range dependencies, alternative architectures emerged, including convolutional neural networks [23] and self-attention-based models. The Transformer model introduced by Vaswani et al. [24] replaced recurrence with multi-head self-attention, enabling parallelization and achieving state-of-the-art results across various translation tasks. These models also support transfer learning across scripts and languages, with multilingual pretrained encoders such as mBERT and XLM-R proving especially effective [25,26].
Character-level translation, such as the transcription task described in this paper, is still largely dominated by attention-based LSTM seq2seq models [27,28], largely because the task typically involves monotonic alignments between input and output sequences that share significant character-level similarity. Unlike tasks requiring semantic understanding or long-range dependencies, transcription benefits less from the Transformer’s memory capacity. Nevertheless, Ref. [29] demonstrated that transformer performance on character-level tasks is highly sensitive to batch size, and that, given sufficiently large batches, transformers can outperform RNN-based models even in these settings.
In this research, we train seq2seq models—including LSTM-based encoder–decoder architectures [28] and transformers [29], both of which have shown strong results in multilingual grapheme-to-phoneme (G2P) tasks [30]—to solve the transcription problem. We also investigate the effects of data augmentation, which has been shown to improve model accuracy in related tasks. Differently from Ref. [30], our task presents additional challenges: we begin with a much larger and noisier dataset, must handle conflicting transcription labels, and incorporate stress marker prediction, adding complexity to both data preprocessing and model training.
The main contributions of this paper are as follows:
  • We propose a novel semi-automatic data processing pipeline that transforms raw web-crawled data into a training set for practical transcription tasks. This pipeline is demonstrated on Lithuanian—a morphologically rich and challenging language—where it effectively processes, filters, and normalizes data containing inflected and/or mixed-language tokens. The pipeline is potentially fully automatable for languages that do not require modeling of word stress type and location.
  • We show that an end-to-end practical transcription model can be trained using this dataset, despite residual noise in the processed data. Although exact word error rate (WER) estimation is hindered by noisy reference labels in the test set, we report an upper-bound WER of approximately 19%, with the actual performance likely being significantly better—potentially nearly half that value.

2. Materials and Methods

2.1. Data

We hypothesized that data for the transcription task could be collected from online sources. It has been observed that Lithuanian news portals frequently present foreign person names in their original form followed by a transcription (intended pronunciation) in parentheses. This practice helps readers unfamiliar with foreign orthography approximate the correct pronunciation.
To leverage this pattern, we performed web scraping across major Lithuanian news portals, collecting 133,254 text segments (68,167 unique patterns) from public articles spanning the past ten years. The data was extracted using a Perl-style regular expression designed to capture two title-cased words followed by another pair of title-cased words enclosed in parentheses (the raw material is made openly available and can be downloaded from Clarin LT repository at https://clarin.vdu.lt/xmlui/handle/20.500.11821/68, accessed on 7 May 2025).
          [[:upper:]][-\x27[:lower:]]+\s+[[:upper:]][-\x27[:lower:]]+\s+\([[:upper:]][-\x27[:lower:]]+\s+[[:upper:]][-\x27[:lower:]]+\s*\)
        
This raw dataset [31], however, proved to be noisy and ambiguous, requiring extensive preprocessing before it could be used for model training. The main sources of noise included:
  • Unintended Matches: The regular expression sometimes captured pairs of proper nouns that were not person names and their transcriptions, but rather unrelated entities, such as person–location, person–actor, person–sports team, or location–location combinations (see Table 2, rows 1–5).
  • Role Inversion: In some cases, the first name pair represented the original name (Table 2, rows 6, 9, 10, 16), while in others, it was the transcription (Table 2, rows 7, 8, 11–15).
  • Word Order Inversion: The word order in one of the name pairs was sometimes reversed relative to the other (see Table 2, row 8).
  • Inflection Mismatch: The original name with concatenated or fused inflectional suffixes could be in a different grammatical case than the transcribed name. For instance, the original name Alexis Tsipras is in the nominative, while the transcription Aleksiui Ciprui is in the dative case (Table 2, row 15). Particularly challenging were these mismatches when intertwined with role inversion. For example, the name pair Alberto Alonso (Table 2, row 16) could be interpreted as a non-inflected original form, an adapted original form (e.g., genitive of Albert Alons), or a transcription in the genitive case.
  • Multiple Inflections: A single non-inflected original name could correspond to multiple transcriptions, each in a different grammatical case depending on the context (Table 2, rows 11–14).
  • Inconsistent Labeling and Human Errors: Different transcriptions of the same original name were observed due to the varying linguistic knowledge of human editors (Table 2, rows 17–19). Additionally, spelling errors appeared in both original names (Table 2, row 20) and transcriptions (Table 2, row 21).

2.2. Method

Our transcription system follows a structured data processing pipeline, as illustrated in Figure 1. It begins with web-scraped textual data and ends in computational models capable of automatically transcribing personal names of foreign origin into Lithuanian.
The methodology consists of several sequential steps:
  • Preprocessing and cleaning of web-scraped data, including filtering, normalization, and reordering of raw data patterns.
  • Adding word stress location and type information to the cleaned data.
  • Generation of multiple training sets, based on different configurations and augmentation strategies.
  • Training of machine learning models, using a range of architectures and training setups and evaluation of their accuracy on held-out test data.

2.2.1. Data Preprocessing

Preprocessing was essential to address the various noise sources described in Section 2.1 and to prepare a clean, structured dataset for model training.
The first preprocessing step involved alignment of original and transcribed strings, which served as a filtering mechanism for the raw patterns. This alignment assumed that valid transcriptions share structural and phonetic similarities—specifically, that the two strings should be composed of corresponding characters or character groups.
For instance, Table 2 (rows 1–5) illustrates clearly invalid alignments, where the initial capital letters of the first and second words in each pair do not match the capitals in the parenthetical pair. However, this matching should not be interpreted as strict character identity, since valid transcriptions often include predictable orthographic adaptations (see rows 11, 15, and 17) (for instance, matching substrings Ch, Ts, C and G to Č, C, K, and Dž, respectively, should be permitted).
To accommodate this, we implemented a dynamic programming alignment procedure called Match(). It identifies correspondences between the original and transcribed forms based on approximately 600 permitted edit operations, including both context-free and context-sensitive substitutions. These operations were non-symmetric—for example, replacing “ault” with “o” is allowed (as in RenaultReno), but the reverse is not permitted, since such transformations are not linguistically plausible in the Lithuanian context.
Alignment was further complicated by morphological variation: both the original and transcribed names could appear in various inflected forms, sometimes with differing grammatical cases and suffixes (e.g., Table 2, rows 11 and 15). To mitigate this, morphological stemming was applied before alignment, reducing both strings to their stemmed forms. Due to possible ambiguities in stemming, a search over the space of candidate stem pairs was necessary.
The simplified pseudo-code below (see Figure 2) illustrates the logic of the NormalizeWordOrder() function. This function takes four word stems: s1, s2, s3, and s4, and infers the roles (original or transcription) and the correct word order based on successful alignments.
Pattern filtering based on the alignment between the original text and its transcription helped eliminate incorrect transcription patterns and normalize instances of role and word order inversion. It also contributed to filtering out some patterns containing spelling errors (e.g., rows 20–21 in Table 2). However, many inconsistent transcriptions remained (see Table 3). This was largely due to the aligner having too much flexibility, allowing all types of editing operations to be applied simultaneously—even when those operations were more appropriate for different languages.
The process of morphological analysis, inflectional paradigm inference, and case normalization significantly reduced both inflectional mismatches and the occurrence of multiple transcriptions derived from a single original name—common in Lithuanian due to the presence of inflectional suffixes. This analysis was performed using purpose-built tools that relied on inflectional endings to make decisions.
As a result of this step, the cases listed in Table 2 (rows 11–14) were consolidated into a single normalized form: Charles Darwin (Čarlz Darvin), appearing 23 times in total.
To further minimize the issue of one-to-many transcription mappings, statistical pattern filtering was applied. A simple rule was introduced: transcriptions occurring less frequently than a specified threshold were discarded (see Table 4). This threshold was empirically set to 5% of the total number of occurrences of a given original name. The goal was to strike a balance between removing as much human-induced noise as possible while preserving legitimate transcription variants found across different languages.

2.2.2. Gathering Stress Data

In Lithuanian, stress placement and type are determined by the morphological properties of a word form. While there are algorithms that can infer stress location and type from Lithuanian orthography [32,33] and the practical implementations available [34], these techniques are dictionary-based and require that word forms be annotated with their morphological properties and accentuation paradigms. However, both the original and the transcribed word forms in our study fall outside the scope of standard Lithuanian dictionaries.
As a result, stress annotation had to be performed manually, making it the only step in our data processing pipeline that prevented it from being fully automatic. Linguists were instructed to place stress marks on the transcribed forms to best reflect the pronunciation of the original language. To speed up the process, stress was annotated only on isolated, single-word transcriptions in the nominative case. These annotations were then algorithmically propagated to corresponding word-pair patterns and other inflected forms.
This manual annotation process inevitably introduced some noise into the data. The linguists were not proficient in the pronunciation and accentuation of all source languages encountered in the dataset, and in many cases had to choose between multiple plausible options. For example, the name Marielė, which may originate from English (Mariel) or French (Marielle), would require different stress placements—Mãrielė for the English version and Marièlė for the French one. Without context, such distinctions were impossible to make.

2.2.3. Generating Training and Test Sets

The filtered, normalized, and stressed data comprised 118,149 total patterns (51,429 unique ones), each of the form O1 O2 → T1 T2, where O1 and O2 are original personal names—written either with or without primary morphological adaptation—and T1 and T2 are their respective transcriptions.
Two base datasets, W1 and W2, were derived from this normalized data, as shown in Table 5. To evaluate the effectiveness of data augmentation in improving transcription accuracy, two additional augmented datasets—W2+R2 and W2+R2+W1—were created.
We chose to retain conflicting transcription labels, as automatically removing them would have required excluding instances from less frequently used languages. Importantly, the frequency with which a pattern appears in the dataset is likely correlated with the correctness of its transcription. Therefore, we opted not to reduce the datasets to unique patterns. Instead, instances were preserved with their original frequencies, allowing repetition.
Because of these conflicting labels, the maximum achievable accuracy is less than 100%. To estimate an upper bound, we calculated the Oracle accuracy, which assumes perfect prediction for non-conflicting cases and selects the most frequent label in cases of conflict (see Table 5).
All datasets were randomly split into training (90%) and testing (10%) subsets. To maintain consistency, the split ensured that not only identical patterns but also all inflectional variants of a given pattern remained within the same partition. For instance, the accusative and genitive forms:
Louisą ZamperiniLuĩsą Zamperìni
Louiso ZamperiniLuĩso Zamperìnio
were placed in the same partition of the W2 dataset. However, this constraint was not enforced at the single-word level; thus, data instances like O1 O2 → T1 T2 and O1 O3 → T1 T3 could appear in the same split even if they shared the same sub-component O1→ T1.
The W2 dataset was partitioned prior to augmentation, ensuring that W2, W2+R2, and W2+R2+W1 shared the same patterns in training and test subsets. This allowed for fair comparisons between non-augmented and augmented models, all evaluated using W2’s test set.
In all experiments, we encoded Lithuanian stress markers as unique symbols appended after the stressed character. Before training, each instance was split into individual characters. The input symbol set contained 64 unique characters, while the output symbol set contained 38.

2.2.4. Training Transcription Models

We trained three different transcription models, each representing a distinct approach to sequence modeling: a symbolic model based on weighted finite-state transducers (WFSTs), a recurrent neural network (RNN)-based sequence-to-sequence (seq2seq) model, and a Transformer-based seq2seq model. These models were selected to explore the trade-offs between model interpretability, training efficiency, and transcription accuracy.
Our first model was a weighted finite-state transducer (WFST), a symbolic approach rooted in probabilistic automata. The model architecture follows the pair n-gram framework [35] and builds on earlier work such as the hidden Markov model-based G2P framework [36]. As in Lee et al. [37], transcriptions are generated by composing the input string with a trained transducer, which yields a lattice of possible output sequences annotated with their associated probabilities. The best hypothesis is then selected using Viterbi decoding [38].
We used the OpenGrm [39] tookit and OpenFst-based [40] Pynini library [41] to train the transducer. Different n-gram orders (n = 2 to 10) were evaluated to balance context sensitivity with model complexity.
The second model we explored is a neural seq2seq architecture based on long short-term memory (LSTM) units. The encoder consisted of a single bidirectional LSTM layer, while the decoder was a single unidirectional LSTM. An attention mechanism [27] bridges the encoder and decoder, enabling the model to dynamically focus on relevant parts of the input sequence during decoding.
We used the Fairseq [42] implementation provided by Ref. [30], with default hyperparameter settings for key training parameters such as learning rate, weight decay, gradient clipping, and label smoothing [43]. For training, we used an Adam optimizer with inverse square root learning rate scheduling and applied early stopping based on validation loss.
Our third model was a Transformer-based neural seq2seq model, optimized for character-level transduction and described by Wu et al. [29]. This architecture replaces recurrence with multi-head self-attention mechanisms and position-wise feedforward layers, enabling efficient parallel training and improved modeling of long-range dependencies.
We adopted a four-layer Transformer architecture for both the encoder and the decoder, each using pre-layer normalization to improve training stability. Again, we used the Fairseq implementation provided by Ref. [30]. Dropout and label smoothing were applied to mitigate overfitting.
For both neural models and for each data set, we conducted hyperparameter tuning on a held-out development set. The following parameters were adjusted: the dimensionality of the encoder embedding layer (EEL), encoder hidden layer (EHL), decoder embedding layer (DEL), and decoder hidden layer (DHL), along with the dropout rate (DOUT) and batch size (BSIZE). Grid search was used to identify optimal configurations. The beam search decoding algorithm was employed with beam sizes ranging from 3 to 10, depending on the model size and sequence complexity. Hyperparameter settings were kept consistent across models as much as possible to ensure fair comparison. Detailed results of the ablation experiments for the Transformer model are provided in Appendix A.

2.3. Evaluation Metrics

To evaluate and compare the performance of the transcription models, we employed word error rate (WER) as the primary metric. WER was defined as the proportion of word instances in which the predicted transcription differed from the reference transcription, with a lower WER indicating superior model performance. In the calculation of WER, each word-pair instance was treated as comprising two discrete word forms.
To further disentangle the contribution of stress placement errors from other symbol-level errors within the WER, we introduced an auxiliary metric termed the stress-compensated word error rate (WER-s). The WER-s is computed analogously to the WER, but after all stress markers have been removed from both the predicted and the reference transcriptions. The difference between the WER and the WER-s thus reflects the proportion of total errors attributable exclusively to incorrect stress placement (i.e., position and/or type).

3. Results

3.1. Model Accuracy

Weighted finite-state transducer, neural encoder–decoder, and transformer models were successfully trained. On the training sets, the word error rates (WERs) achieved by these models were close to the oracle WER values reported in Table 5. However, their performance on the test sets showed significantly higher WERs, indicating a drop in generalization. Table 6 presents the best results obtained for each model type and dataset, based on a search over the respective hyperparameter spaces.
The WFST model demonstrated the lowest performance among the evaluated approaches. The considerable gap between the WER and the WER-s (see Figure 3) indicates that it struggled particularly with accurately predicting stress location and type compared to the neural models. Detailed analysis of its output revealed frequent issues, such as missing or multiple stress markers, even though each training instance contained exactly one stress marker. Additionally, data augmentation negatively affected the WFST model, likely due to increased distributional mismatch between the training and test data.
In contrast, both neural models—encoder–decoder and transformer—performed significantly better. Across all training sets, the best or near-best results were achieved using a smaller encoder (EEL = 128, EHL = 512) and a larger decoder (DEL = 256, DHL = 1024).
Overall, the transformer model outperformed the encoder–decoder model, particularly in the single-word transcription task (W1), where the differences in WER and WER-s are statistically significant. For augmented word-pair tasks, the transformer also achieved a significantly lower WER. However, the differences in WER-s were not statistically significant, suggesting that both models are comparable in basic character-to-character transcription, while the transformer demonstrates superior handling of stress regularities.
Interestingly, both models performed worse on the W2+R2+W1 dataset compared to W2+R2, suggesting that not all types of data augmentation are beneficial. We attribute this decline to the mismatch between training and test data distributions—specifically, the inclusion of single-word training instances in W2+R2+W1, while the test set contained only word-pair instances.

3.2. Detailed Error Analysis

Since the training datasets were generated via a semi-automated data processing pipeline, we conducted a more in-depth investigation into the nature of transcription errors. The term “error” can be misleading in this context, as discrepancies between the reference transcription (in the test set) and the model’s prediction are not necessarily genuine mistakes—particularly when the reference itself is incorrect.
To explore this further, we selected the 100 most frequent unique transcription mismatches (representing 9.4% of the total error mass) made by the best-performing model (a Transformer trained on the W2+R2 dataset) and attempted to systematically categorize them.
First, we manually reviewed and corrected the reference transcriptions where necessary. These 100 adjusted entries form what we refer to as the golden standard. Next, we categorized the observed mismatches—both between the original references and the golden standard, and between the model predictions and the golden standard—into four levels of severity, as shown in Table 7.
The distribution of these mismatches is captured in Table 8, which cross-tabulates the reference vs. predicted errors by category:
Assuming this sample is representative of the full dataset, we can presume that the training and test data still contain a significant amount of noise. In 53.9% of test instances, the reference transcription deviates from the golden standard even though only 11.5% of instances show perceptually significant mismatches. This suggests that the pattern-filtering techniques used in the preprocessing pipeline (as described in Section 2.2.1) are effective at reducing severe errors but allow smaller inconsistencies to persist.
Consequently, the WER and WER-s metrics reported earlier should be interpreted with caution. They likely represent pessimistic estimates of actual model performance.
Analysis of the largest error category—perceptually prominent mismatches—reveals several recurring error types. The most common are “language misidentification” (the term language identification may be imprecise in this context, as the inputs often consist of mixed-language tokens rather than text in a single, clearly defined language) errors, as if the model had applied transcription rules from a more frequent language (typically English) to names from less represented languages. Examples include:
Tusk (Polish surname) transcribed as Tãsk, as if it were English.
Mujica (Uruguayan) rendered as Mužìka instead of Muchìka, influenced by French.
Sergio (Italian) misrendered as Ser̃chijo instead of Ser̃džijo, as if it were Spanish.
Some other errors result from missing diacritics in the original names. For instance, Walesa becomes Valèsa instead of the correct Valènsa or Valeñsa. The correct source should have been Wałęsa, which carries diacritic marks crucial for accurate transcription.
The full list of the top 100 transcription errors, including their golden standard corrections and assigned categories, is provided in Appendix B.

4. Conclusions

This study represents our initial investigation into the transcription of proper names in Lithuanian. We developed a data processing pipeline to transform raw web-crawled data into a training set suitable for this practical transcription task. For Lithuanian, the pipeline is semi-automatic due to the requirement for manual stress annotation by human labelers. However, for other target languages with more regular stress patterns (e.g., French or Latvian), this step could be omitted, potentially enabling a fully automatic pipeline.
The filtering and normalization techniques applied to the crawled data were effective, reducing noise by 10.7% (based on occurrence frequency) and decreasing the number of pattern types by 24.6%. Residual noise was assessed indirectly through detailed inspection of the 100 most frequent instances in the test set. Approximately 54% of the inspected transcriptions could be improved, though only around 11% required perceptually significant corrections. These findings suggest that the training data quality is sufficient to enable convergence of sequence-to-sequence models, although it remains below the standard typically achieved by human-curated datasets.
Among the models evaluated, the best-performing was a sequence-to-sequence Transformer, which achieved a word error rate (WER) of 19.04%, or 15.66% when accounting for stress compensation, on a word-pair dataset augmented with word order reversals. While this performance still lags behind human transcription accuracy (with an oracle WER of 5.43%), it represents a substantial improvement for text-to-speech (TTS) systems that currently lack any effective loanword adaptation component.
Furthermore, models trained on word pairs outperformed those trained on isolated words. The best-performing single-word model yielded a WER of 25.49% (19.81% stress-compensated), supporting previous findings that sequence-to-sequence models benefit from extended input context. However, to draw more definitive conclusions about the relative effectiveness of model architectures or hyperparameter configurations, further reduction in residual noise in the test data is needed.
We also examined data augmentation strategies, such as word order reversal and combining single-word and word-pair datasets. The results show that word order reversal led to a 5% relative improvement in model performance, while the combination of single-word and word-pair datasets proved less effective.
Several important research and practical questions remain open to future investigation:
  • Alternative approaches. Exploring alternative approaches to the practical transcription problem, as introduced in Section 1.1 and Section 1.2, remains a promising direction for future work. The Generative AI approach may yield acceptable performance in handling mixed-language tokens, particularly when pre-trained large language models (LLMs) are fine-tuned or instruction-tuned for this specific task. Within the end-to-end machine learning framework, LLMs could also assist in cleaning and normalizing raw data patterns. Furthermore, LLMs may be employed to extract supplementary features—such as speaker nationality or linguistic background—which could enhance the predictive performance of transcription models.
  • Stress modeling. This study incorporated several arbitrary choices regarding stress modeling. Stress placement was integrated into the end-to-end system, with the stress mark encoded as an additional ASCII symbol following the stressed character. An alternative approach would be to treat stress placement as an independent machine learning task. The stress mark could be embedded within an accented character, forming a single output symbol. This strategy would expand the output symbol set rather than increase the length of the output sequence. Additionally, weak supervision techniques [44] or semi-supervised [45] stress modeling approaches could be explored to partially automate this task.
  • Experimenting with the dataset. It is important to continue manual cleaning of the test set to establish a fully curated, gold-standard benchmark, thereby increasing confidence in the accuracy estimates. The dataset could also be enriched with longer name sequences (e.g., three or more words) to better reflect the diversity of naming conventions. Additionally, since morphological analysis is already a component of the data processing pipeline, the dataset could be augmented with inflected forms of names not originally present. For example, from the nominative form Charlesas Darwinas → Čarlzas Darvinas, one could derive genitive forms such as Charles’o Darwin’o → Čarlzo Darvino or dative forms such as Charlesui Darwinui → Čarlzui Darvinui.
  • Improving data filtering. The current alignment procedure appears overly permissive. Although stricter alignment criteria may require additional linguistic resources, a promising improvement would be to assign language labels to each permitted substitution. This would restrict substitutions to those within the same language set, potentially enhancing data consistency, filtering precision, and overall transcription accuracy.
We plan to explore these avenues in subsequent studies, with the goal of building a more robust and linguistically informed transcription framework.

Author Contributions

Conceptualization, G.R. and T.K.; methodology, D.V.-A.; software, T.K. and D.K.; validation, A.M., D.A., D.K. and G.R.; resources, G.R. and D.K.; writing—original draft preparation, G.R.; writing—review and editing, D.V.-A. and A.M.; supervision, G.R. and T.K.; project administration, T.K. and A.Č. All authors have read and agreed to the published version of the manuscript.

Funding

This research/The project was co-funded by the European Union under Horizon Europe programme grant agreement No. 101059903; and by the European Union funds for the period 2021–2027 and the state budget of the Republic of Lithuania financial agreement Nr. 10-042-P-0001.

Data Availability Statement

The original data presented in the study are openly available in Clarin LT repository at https://clarin.vdu.lt/xmlui/handle/20.500.11821/68.

Acknowledgments

The authors would like to express their sincere gratitude to Regina Sabonytė for her valuable linguistic expertise regarding the placement of stress markers in the transcribed personal names.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:
WFSTWeighted Finite State Transducer
LSTMLong Short-Term Memory
WERWord Error Rate
WER-sStress-compensated Word Error Rate
TTSText-to-Speech
G2PGrapheme-to-Phoneme
LangIDLanguage identification task
LLMLarge Language Model
IPAInternational Phonetic Alphabet
GPTGenerative Pre-trained Transformer
RNNRecurrent Neural Network
MLMachine Learning
ASCIIAmerican character encoding standard

Appendix A

Table A1. Results of ablation experiments using the Transformer model. EEL, EHL, DEL, and DHL denote the dimensionality of the encoder’s (E) and decoder’s (D) embedding (E) and hidden (H) layers, respectively. The table reports the dependency of the estimated word error rate (WER) and stress-adjusted word error rate (WER-s) on different hyperparameter values.
Table A1. Results of ablation experiments using the Transformer model. EEL, EHL, DEL, and DHL denote the dimensionality of the encoder’s (E) and decoder’s (D) embedding (E) and hidden (H) layers, respectively. The table reports the dependency of the estimated word error rate (WER) and stress-adjusted word error rate (WER-s) on different hyperparameter values.
TaskEELEHLDELDHLBatch
Size
Dropout
Rate
WER, %WER-s, %
1285121285122560.129.13 ± 0.8522.99 ± 0.76
1285121285122560.326.95 ± 0.2921.22 ± 0.18
12851212851210240.127.88 ± 0.4321.51 ± 0.48
12851212851210240.328.44 ± 0.2922.33 ± 0.19
12851225610242560.127.90 ± 0.1922.42 ± 0.19
12851225610242560.327.50 ± 0.3921.96 ± 0.35
128512256102410240.128.65 ± 0.1422.46 ± 0.20
W1128512256102410240.325.49 ± 0.2519.81 ± 0.22
25610241285122560.127.98 ± 0.5921.95 ± 0.45
25610241285122560.327.85 ± 0.2121.71 ± 0.19
256102412851210240.129.35 ± 0.5223.20 ± 0.54
256102412851210240.327.73 ± 0.6121.32 ± 0.54
256102425610242560.128.26 ± 0.3121.69 ± 0.28
256102425610242560.326.83 ± 0.4321.16 ± 0.44
2561024256102410240.128.84 ± 0.2622.17 ± 0.26
2561024256102410240.327.22 ± 0.5521.27 ± 0.59
1285121285122560.121.58 ± 0.3617.31 ± 0.29
1285121285122560.321.31 ± 0.1416.39 ± 0.09
12851212851210240.121.70 ± 0.3916.89 ± 0.32
12851212851210240.321.88 ± 0.3217.10 ± 0.17
12851225610242560.120.59 ± 0.2416.77 ± 0.18
12851225610242560.319.98 ± 0.1916.19 ± 0.14
128512256102410240.121.12 ± 0.2517.10 ± 0.22
W2128512256102410240.320.81 ± 0.3016.47 ± 0.22
25610241285122560.121.60 ± 0.1817.56 ± 0.13
25610241285122560.320.65 ± 0.4016.14 ± 0.25
256102412851210240.121.53 ± 0.2517.49 ± 0.27
256102412851210240.321.46 ± 0.0716.90 ± 0.09
256102425610242560.120.83 ± 0.1617.11 ± 0.17
256102425610242560.320.47 ± 0.2716.90 ± 0.24
2561024256102410240.121.48 ± 0.2917.34 ± 0.24
2561024256102410240.320.60 ± 0.2016.41 ± 0.09
1285121285122560.120.01 ± 0.1616.22 ± 0.14
1285121285122560.320.78 ± 0.1416.44 ± 0.09
12851212851210240.120.26 ± 0.1316.52 ± 0.22
12851212851210240.320.78 ± 0.0916.39 ± 0.06
12851225610242560.120.06 ± 0.1416.48 ± 0.13
12851225610242560.319.04 ± 0.1815.66 ± 0.12
128512256102410240.120.19 ± 0.2716.49 ± 0.24
W2+R2128512256102410240.319.54 ± 0.2016.15 ± 0.08
25610241285122560.120.20 ± 0.2416.69 ± 0.21
25610241285122560.320.02 ± 0.1316.12 ± 0.13
256102412851210240.120.49 ± 0.3216.68 ± 0.32
256102412851210240.320.35 ± 0.1816.25 ± 0.14
256102425610242560.119.73 ± 0.1916.24 ± 0.16
256102425610242560.319.57 ± 0.2216.40 ± 0.19
2561024256102410240.120.28 ± 0.3016.55 ± 0.25
2561024256102410240.319.96 ± 0.0816.43 ± 0.10
1285121285122560.119.83 ± 0.2116.25 ± 0.23
1285121285122560.320.92 ± 0.0616.53 ± 0.02
12851212851210240.120.07 ± 0.1116.42 ± 0.08
12851212851210240.320.91 ± 0.0716.70 ± 0.06
12851225610242560.119.68 ± 0.1916.70 ± 0.17
12851225610242560.319.58 ± 0.0516.02 ± 0.08
128512256102410240.119.30 ± 0.1716.16 ± 0.17
W2+R2+R1128512256102410240.319.86 ± 0.1216.26 ± 0.09
25610241285122560.119.95 ± 0.2316.28 ± 0.08
25610241285122560.320.12 ± 0.1616.25 ± 0.06
256102412851210240.120.14 ± 0.2216.38 ± 0.25
256102412851210240.320.55 ± 0.1216.58 ± 0.13
256102425610242560.120.16 ± 0.1116.77 ± 0.07
256102425610242560.319.51 ± 0.1516.30 ± 0.12
2561024256102410240.119.96 ± 0.1116.62 ± 0.08
2561024256102410240.319.51 ± 0.1716.26 ± 0.15

Appendix B

Table A2. The list of the 100 most frequent mismatches between the transcription reference and prediction, along with their categorization based on the gold standard. The symbols 0, s, m, and l denote a match, small mismatch, medium mismatch, and large mismatch, respectively.
Table A2. The list of the 100 most frequent mismatches between the transcription reference and prediction, along with their categorization based on the gold standard. The symbols 0, s, m, and l denote a match, small mismatch, medium mismatch, and large mismatch, respectively.
Occ.OriginalReferencePredictionGolden StandardRef. CatPred. Cat.
9Alexį TsiprąAlèksį CìprąÃleksį Cìprą 0m
5Alexio TsiproAlèksio TsìproAlèksijo CìproAlèksio Cìpross
60Alexio TsiproAlèksio CìproAlèksijo Cìpro 0s
19Alexis TsiprasAlèksis TsìprasAlèksis Cìpras s0
5Ali ZeidanasÃli ZeidãnasAlì ZeidãnasÃli Zeĩdenasmm
6Amy PoehlerEĩmi PoũlerEĩmi Pèler 0l
6Andrea BocelliAndrèa BočèliAndrė̃ja Bočèli 0s
14Anna WintourÃna ViñturÃna Vintùr 0s
11Bambangas SoelistyoBambángas SulìstjoBambángas SoelìstoBembéngas Sulìstjoml
10Blake LivelyBleĩk LáivliBleĩk Lìvli 0l
6Brendan GilliganBréndan GìliganBréndan Džìligan 0l
9Buzz AldrinBãz ÒldrinBùz ÁldrinBàz Òldrinsl
5Buzzas AldrinasBãzas ÒldrinasBùzas Al̃drinasBàzas Òldrinassl
14Carlos DelfinoKárlos DelfìnoKárlos Del̃fino 0m
9Caroline KennedyKèrolain KènediKarolìn Kènedi 0l
6Carrie PrejeanKèri PreidžánKèri Prežán m0
6Chris CassidyKrìs KãsidiKrìs Kẽsidi m0
10Cindy CrawfordSiñdi KráufordSiñdi KroũfordSiñdi Krõfordss
13Cindy CrawfordSiñdi KròfordSiñdi KroũfordSiñdi Krõfordss
8David LynchDeĩvid LiñčDeĩvid Lỹnč 0s
10Dilmai RousseffDìlmai RùsefDil̃mai RùsefDil̃mai Rusèfss
101Donald TuskDònald TùskDònald Tãsk 0l
5Donaldas TuskasDònaldas TùskasDònaldas Tãskas 0l
5Ene ErgmaÈn ÈrgmaÈn Er̃gmaÈne Èrgmall
5Geir LundestadGeĩr LuñdestadGeĩr Liùndestad 0s
6Yingluck ShinawatraJiñglak ČinavãtaIñglak ŠinavãtraJiñglak Šinavãtrals
15Yingluck ShinawatraJinglùk ŠinavãtraIñglak ŠinavãtraJiñglak Šinavãtrams
5Yingluck ShinawatrosJiñglak ČinavãtosIñglak ŠinavãtosJiñglak Šinavãtrosll
5Yukio EdanoJùkio EdãnoJùkijo Edãno 00
8Jean SibeliusJán SibèlijusŽán Sibèlijus l0
5Jeanui MonnetŽãnui MonèŽãnui Monė̃ s0
8Jennifer HudsonDžènifer HãdsonDžènifer Hàdson s0
6Jerry RubinDžèri RùbinDžèri Rãbin 0l
5Jiroemonas KimuraDžiroemònas KimùraŽirumònas Kimùra 0l
8Joakim NoahŽoakìm NòaJoakìm NòaDžoũakim Noũall
8Johnas KirbyDžònas Kir̃biDžònas Ker̃bi s0
10Johno KirbyDžòno Ker̃biDžòno Kir̃bi 0s
14Jose MujicaChosė̃ MuchìkaChosė̃ Mužìka 0l
6Josephas MuscatasDžòzefas MuskãtasDžòzefas MaskãtasDžoũzefas Maskãtasms
5Juan CarlosChuán KárlChuán Kárlos l0
6Kei NishikoriKèi NišikòriKeĩ Nišikòri 00
8Kemalis KilicdarogluKemãlis KiličdaròhluKemãlis Kilikdaròhlu 0l
5Kenneth CampbellKènet KémbelKènet Kémpbel 0s
6Kerem GonlumKerèm GeñliumKerèm Gònlam 0l
17Kianoushas JahanpourasKi-ãnušas Džahanpū̃rasKi-ãnušas DžahanpùrasKi-anùšas Džahanpū̃rassm
10Konrad AdenauerKònrad ÃdenauerKònrad Adenáuer 0s
5Kurt VonnegutKùrt VònegutKùrt FònegutKùrt Vãnegatml
12Lance StephensonLeñs StèfensonLeñs StìvensonLeñs Stỹvensonls
14Laurent GbagboLòren GbãgboLorán GbãgboLorán Gbagbòss
9Laurent’as FabiusasLorãnas FãbijusasLorãnas Fabiùsas s0
21Lech WalesaLèch ValènsaLèch Valèsa 0l
5Lechą WalesąLèchą ValènsąLèchą Valèsą 0l
30Lechas WalesaLèchas ValènsaLèchas Valèsa 0l
10Lecho WalesosLècho ValènsosLècho Valèsos 0l
6Mariah CareyMarãja KèriMarìja KèriMerãja Kèrisl
18Mariah CareyMerãja KèriMarìja Kèri 0l
12Marie TrintignantMarì TrentinjánMarỹ TrintinjánMarỹ Trentinjánss
6Marisol TouraineMarizòl TureñMarisòl Turèn m0
6Marissa MayerMarìsa MèjerMarìsa Mãjer s0
10Marlon BrandoMar̃lon BrándoMar̃lon BreñdoMárlon Bréndouss
11Michel HazanaviciusMišèl HazanãvičiusMišèl Azanãvičius s0
6Milton FriedmanMil̃ton FriẽdmanMil̃ton FrìdmanMil̃ton Frỹdmenlm
6Milton FriedmanMil̃ton FrỹdmanMil̃ton FrìdmanMil̃ton Frỹdmensm
16Monta EllisMònta ÈlisMònta ẼlisMòntei Èlissm
6Navi PillayNãvi PilájNãvi PiláiNãvi Pìleimm
23Nene HilarioNèn IlãrijoNèn HìlarioNenè Ilãrijull
10Nicki MinajNìki MinãdžNìki MinãjNìki Minãžll
65Oprah WinfreyÒpra VìnfriÒpra ViñfriÒupra Viñfrimm
7Peter HessPẽter HèsPìter Hès m0
6Peteris AltmaierisPė̃teris ÁltmajerisPė̃teris Altmãjeris 0s
6Peteris LernerisPìteris Ler̃nerisPė̃teris Ler̃neris 0l
7Peteris SzijjartoPė̃teris SìjartoPė̃teris Šidžárto 0l
6Pol PotPòl PòtPõl Pòt 0s
7Prabowo SubiantoPrãbovo Subi-ántoPrabòvo Subi-ánto s0
5Prosper MerimeePròsper MerìmPròsper MerìmiProspèr Merimė̃ll
10Raffaele SollecitoRafaèl SolečìtoRafaèl SolesìtoRafaèle Solèčitoml
6Ralphas FiennesasRálfas FáinsasRálfas Fi-ènesasReĩfas Fáinzasll
10Ryan ToolsonRaján Tū̃lsonRaján TùlsonRãjan Tū̃lsonmm
18Sabine KehmSabìn Kė̃mSabìn KèmZabỹne Kė̃mll
6Sakellie DanielsSakelì Dẽni-elsSakelì Dãni-elsSakèli Dẽni-elsss
5Salma HayekSèlma HãjekSálma Hãjek 0m
6Salvador AllendeSalvadòr AljèndSalvadòr AlèndSalvadòr Aljèndell
4Salvadoras AllendeSalvadòras AljèndSalvadòras AlèndSalvadòras Aljèndell
5Samantha MurraySamánta Miurė̃jSamánta Mùrėj 0m
19Sergio MattarellaSer̃džijo MatarèlaSer̃chijo Matarèla 0l
4Shuji NakamuraŠiùdži NakamùraŠùdži Nakamùra s0
6Syd BarrettSìd BãretSáid BãretSìd Bẽretsl
9Silvio BerlusconiSìlvijo BerluskòniSil̃vijo Berluskòni 00
5Simon FrekleySáimon FrìkliSáimon Frèkli s0
4Stephenas MullasStìvenas MãlasStìvenas MàlasStỹvenas Màlasss
9Steve WozniakStìv VòzniakStỹv Vòzniak s0
7Steven TheedeStỹven TỹdStìven Tìd 0s
4Taneras YildizasTãneras JildìzasTanèras Jildìzas 0s
14Thomas HobbesTòmas HòbsTòmas HòbesTòmas Hòbzsl
19Timothy GeithnerTìmoti GáitnerTìmoti Geĩtner 0l
13Valerie TrierweilerValerì TrìjervelerValerì TrỹrvailerValerỹ Trijervailèrmm
28Valerie TrierweilerValerì TrirveĩlerValerì TrỹrvailerValerỹ Trijervailèrmm
4Vitaly KamlukVitãli KamliùkVitãli Kamlùk 00
9Woodrow WilsonVùdrov VìlsonVùdrou Vìlson s0
5Woodrow WilsonoVùdrau VìlsonoVùdrou Vìlsono s0

References

  1. McArthur, T. Transliteration. In Concise Oxford Companion to the English Language; Oxford University Press: Oxford, UK, 2018; Available online: https://www.encyclopedia.com/humanities/encyclopedias-almanacs-transcripts-and-maps/transliteration (accessed on 7 May 2025).
  2. Superanskaja, A.V. Teoreticheskie Osnovy Prakticheskoj Transkripcii [Theoretical Foundations of Practical Transcription], 2nd ed.; LENAND: Moscow, Russia, 2018; pp. 10–40. Available online: https://archive.org/details/raw-..-2018/page/1/mode/2up (accessed on 7 May 2025). (In Russian)
  3. Lui, M.; Baldwin, T. langid.py: An off-the-shelf language identification tool. In Proceedings of the ACL 2012 System Demonstrations, Jeju Island, Republic of Korea, 9–11 July 2012; pp. 25–30. [Google Scholar]
  4. Joulin, A.; Grave, E.; Bojanowski, P.; Mikolov, T. Bag of tricks for efficient text classification. arXiv 2017, arXiv:1607.01759. Available online: https://arxiv.org/abs/1607.01759 (accessed on 7 May 2025).
  5. Google. Compact Language Detector v3 (CLD3). Available online: https://github.com/google/cld3 (accessed on 7 May 2025).
  6. Papariello, L. XLM-Roberta-Base Language Detection. Hugging Face. 2021. Available online: https://huggingface.co/papluca/xlm-roberta-base-language-detection (accessed on 7 May 2025).
  7. Apple. Language Identification from Very Short Strings. Apple Machine Learning Research. 2019. Available online: https://machinelearning.apple.com/research/language-identification-from-very-short-strings (accessed on 7 May 2025).
  8. Toftrup, M.; Sørensen, S.A.; Ciosici, M.R.; Assent, I. A reproduction of Apple’s bi-directional LSTM models for language identification. arXiv 2021, arXiv:2102.06282. Available online: https://arxiv.org/abs/2102.06282 (accessed on 24 April 2025).
  9. Moillic, J.; Ismail Fawaz, H. Language Identification for Very Short Texts: A review. Medium. 2022. Available online: https://medium.com/besedo-engineering/language-identification-for-very-short-texts-a-review-c9f2756773ad (accessed on 7 May 2025).
  10. Kostelac, M. Comparison of Language Identification Models. ModelPredict. 2021. Available online: https://modelpredict.com/language-identification-survey (accessed on 7 May 2025).
  11. International Phonetic Association. Handbook of the International Phonetic Association: A Guide to the Use of the International Phonetic Alphabet; Cambridge University Press: Cambridge, UK, 1999. [Google Scholar] [CrossRef]
  12. OpenAI. GPT-4 Technical Report. 2023. Available online: https://arxiv.org/pdf/2303.08774 (accessed on 24 April 2025).
  13. Wei, J.; Wang, X.; Schuurmans, D.; Bosma, M.; Ichter, B.; Xia, F.; Chi, E.; Le, Q.; Zhou, D. Finetuned language models are zero-shot learners. arXiv 2021, arXiv:2109.01652. Available online: https://arxiv.org/abs/2109.01652 (accessed on 7 May 2025).
  14. Ouyang, L.; Wu, J.; Jiang, X.; Almeida, D.; Wainwright, C.; Mishkin, P.; Zhang, C.; Agarwal, S.; Slama, K.; Ray, A.; et al. Training language models to follow instructions with human feedback. arXiv 2022, arXiv:2203.02155. Available online: https://arxiv.org/abs/2203.02155 (accessed on 7 May 2025).
  15. Zhang, S.; Liang, Y.; Shin, R.; Chen, M.; Du, Y.; Li, X.; Ram, A.; Zhang, Y.; Ma, T.; Finn, C. Instruction tuning for large language models: A survey. arXiv 2023, arXiv:2308.10792. Available online: https://arxiv.org/abs/2308.10792 (accessed on 7 May 2025).
  16. Ainsworth, W. A system for converting English text into speech. IEEE Trans. Audio Electroacoust. 1973, 21, 288–290. [Google Scholar] [CrossRef]
  17. Elovitz, H.; Johnson, R.; McHugh, A.; Shore, J. Letter-to-sound rules for automatic translation of English text to phonetics. IEEE Trans. Acoust. Speech Signal Process. 1976, 24, 446–459. [Google Scholar] [CrossRef]
  18. Divay, M.; Vitale, A.J. Algorithms for grapheme-phoneme translation for English and French: Applications for database searches and speech synthesis. Comput. Linguist. 1997, 23, 495–523. [Google Scholar]
  19. Damper, R.I.; Eastmond, J.F. A comparison of letter-to-sound conversion techniques for English text-to-speech synthesis. Comput. Speech Lang. 1997, 11, 33–73. [Google Scholar]
  20. Finch, A.; Sumita, E. Phrase-based machine transliteration. In Proceedings of the Workshop on Technologies and Corpora for Asia-Pacific Speech Translation, Hyderabad, India, 11 January 2008. [Google Scholar]
  21. Sutskever, I.; Vinyals, O.; Le, Q.V. Sequence to sequence learning with neural networks. In Proceedings of the NeurIPS 2014, Montreal, Canada, 8–11 December 2014. [Google Scholar]
  22. Bahdanau, D.; Cho, K.; Bengio, Y. Neural machine translation by jointly learning to align and translate. In Proceedings of the ICLR 2015, San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
  23. Gehring, J.; Auli, M.; Grangier, D.; Dauphin, Y.N. Convolutional sequence to sequence learning. In Proceedings of the ICML 2017, Sydney, Australia, 6–11 August 2017. [Google Scholar]
  24. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the NeurIPS 2017, Long Beach, CA, USA, 4–9 December 2017; pp. 6000–6010. [Google Scholar]
  25. Johnson, M.; Schuster, M.; Le, Q.V.; Krikun, M.; Wu, Y.; Chen, Z.; Thorat, N.; Viégas, F.; Wattenberg, M.; Corrado, G.; et al. Google’s multilingual neural machine translation system: Enabling zero-shot translation. Trans. Assoc. Comput. Linguist. 2017, 5, 339–351. [Google Scholar] [CrossRef]
  26. Conneau, A.; Lample, G.; Ranzato, M.; Denoyer, L.; Jégou, H. Unsupervised cross-lingual representation learning at scale. In Proceedings of the ACL 2020, Online, 5–10 July 2020; pp. 8440–8451. Available online: https://aclanthology.org/2020.acl-main.747/ (accessed on 7 May 2025).
  27. Luong, M.T.; Pham, H.; Manning, C.D. Effective approaches to attention-based neural machine translation. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Lisbon, Portugal, 17–21 September 2015; pp. 1412–1421. [Google Scholar]
  28. Cotterell, R.; Kirov, C.; Sylak-Glassman, J.; Walther, G.; Vylomova, E.; McCarthy, A.D.; Kann, K.; Mielke, S.J.; Nicolai, G.; Silfverberg, M.; et al. The CoNLL–SIGMORPHON 2018 Shared Task: Universal Morphological Reinflection. In Proceedings of the CoNLL–SIGMORPHON 2018 Shared Task, Brussels, Belgium, 31 October–1 November 2018; pp. 1–27. [Google Scholar]
  29. Wu, S.; Cotterell, R.; Hulden, M. Applying the transformer to character-level transduction. arXiv 2020, arXiv:2005.10213. [Google Scholar]
  30. Gorman, K.; Ashby, L.F.E.; Goyzueta, A.; McCarthy, A.; Wu, S.; You, D. The SIGMORPHON 2020 Shared Task on Multilingual Grapheme-to-Phoneme Conversion. In Proceedings of the 17th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology, Online, 10 July 2020; pp. 40–50. [Google Scholar]
  31. Raškinis, G. Transliteration List of Foreign Person Names into Lithuanian v.1; CLARIN-LT: Kaunas, Lithuania, 2025; Available online: http://hdl.handle.net/20.500.11821/68 (accessed on 7 May 2025).
  32. Norkevičius, G.; Raškinis, G.; Kazlauskienė, A. Knowledge-based grapheme-to-phoneme conversion of Lithuanian words. In Proceedings of the SPECOM 2005, 10th International Conference Speech and Computer, Patras, Greece, 17–19 October 2005; pp. 235–238. [Google Scholar]
  33. Kazlauskienė, A.; Raškinis, G.; Vaičiūnas, A. Automatinis Lietuvių Kalbos žodžių Skiemenavimas, Kirčiavimas, Transkribavimas [Automatic Syllabification, Stress Assignment and Phonetic Transcription of Lithuanian Words]; Vytautas Magnus University: Kaunas, Lithuania, 2010; Available online: https://hdl.handle.net/20.500.12259/254 (accessed on 15 April 2025). (In Lithuanian)
  34. Kirčiuoklis—A Tool for Placing Stress Marks on Lithuanian Words. Available online: https://kalbu.vdu.lt/mokymosi-priemones/kirciuoklis/ (accessed on 24 April 2025).
  35. Novak, J.R.; Minematsu, N.; Hirose, K. Phonetisaurus: Exploring grapheme-to-phoneme conversion with joint n-gram models in the WFST framework. Nat. Lang. Eng. 2016, 22, 907–938. [Google Scholar] [CrossRef]
  36. Taylor, P. Hidden Markov models for grapheme to phoneme conversion. In Proceedings of the Interspeech 2005, Lisbon, Portugal, 4–8 September 2005; pp. 1973–1976. [Google Scholar] [CrossRef]
  37. Lee, J.L.; Ashby, L.F.E.; Garza, M.E.; Lee-Sikka, Y.; Miller, S.; Wong, A.; McCarthy, A.D.; Gorman, K. Massively multilingual pronunciation mining with WikiPron. In Proceedings of the 12th Language Resources and Evaluation Conference, Marseille, France, 11–16 May 2020; pp. 4216–4221. [Google Scholar]
  38. Viterbi, A.J. Error bounds for convolutional codes and an asymptotically optimum decoding algorithm. IEEE Trans. Inf. Theory 1967, 13, 260–269. [Google Scholar] [CrossRef]
  39. Roark, B.; Sproat, R.; Allauzen, C.; Riley, M.; Sorensen, J.; Tai, T. The OpenGrm open-source finite-state grammar software libraries. In Proceedings of the ACL 2012 System Demonstrations, Jeju Island, Republic of Korea, 9–11 July 2012; pp. 61–66. [Google Scholar]
  40. Allauzen, C.; Riley, M.; Schalkwyk, J.; Skut, W.; Mohri, M. OpenFst: A general and efficient weighted finite-state transducer library. In Proceedings of the CIAA, Prague, Czech Republic, 16–18 July 2007; pp. 11–23. [Google Scholar]
  41. Gorman, K. Pynini: A Python library for weighted finite-state grammar compilation. In Proceedings of the SIGFSM Workshop on Statistical NLP and Weighted Automata, Berlin, Germany, 12 August 2016; pp. 75–80. [Google Scholar]
  42. Ott, M.; Edunov, S.; Baevski, A.; Fan, A.; Gross, S.; Ng, N.; Grangier, D.; Auli, M. Fairseq: A fast, extensible toolkit for sequence modeling. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations), Minneapolis, MN, USA, 2–7 June 2019; pp. 48–53. [Google Scholar]
  43. Szegedy, C.; Vanhoucke, V.; Ioffe, S.; Shlens, J.; Wojna, Z. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 2818–2826. [Google Scholar]
  44. Song, H.; Kim, M.; Park, D.; Shin, J. Learning from noisy labels with deep neural networks: A survey. arXiv 2020, arXiv:2007.08199. Available online: https://arxiv.org/abs/2007.08199 (accessed on 7 May 2025). [CrossRef] [PubMed]
  45. Sohn, K.; Berthelot, D.; Li, C.; Zhang, Z.; Carlini, N.; Cubuk, E.D.; Kurakin, A.; Zhang, H.; Raffel, C. FixMatch: Simplifying semi-supervised learning with consistency and confidence. arXiv 2020, arXiv:2001.07685. Available online: https://arxiv.org/abs/2001.07685 (accessed on 7 May 2025).
Figure 1. Overview of the data processing pipeline from raw web data to trained transcription models.
Figure 1. Overview of the data processing pipeline from raw web data to trained transcription models.
Mathematics 13 02107 g001
Figure 2. Simplified pseudo-code of the word role and order normalization function, which outputs a normalized transcription pattern from four word stems.
Figure 2. Simplified pseudo-code of the word role and order normalization function, which outputs a normalized transcription pattern from four word stems.
Mathematics 13 02107 g002
Figure 3. Word error rate (WER) and stress-adjusted word error rate (WER-s) estimated on the test partitions for WFST-based models trained on single-word (a) and word-pair (b) datasets.
Figure 3. Word error rate (WER) and stress-adjusted word error rate (WER-s) estimated on the test partitions for WFST-based models trained on single-word (a) and word-pair (b) datasets.
Mathematics 13 02107 g003
Table 1. Examples of personal names and IPA transcriptions provided by GPT-4o.
Table 1. Examples of personal names and IPA transcriptions provided by GPT-4o.
No.Original Person NameLLM Answer (IPA Transcription)
1Bill Michael/bɪl ˈmaɪkəl/
2Michael Krüger/ˈmɪçaˌeːl ˈkʁyːɡɐ /
3Florence Ballard/ˈflɔːrəns ˈbælərd/
4Brendan Gilligan/ˈbrɛn.dən ˈɡɪ.lɪ.ɡən/
5Mariah Carey/məˈraɪ.ə ˈkɛɹ.i/
6Valerie Trierweiler/va.le.ʁi tʁi.ɛʁ.vaj.lɛʁ/
7Sergio Mattarella/ˈsɛrdʒo mattaˈrɛlla/
8Donald Tusk/ˈdɔnalt tʊsk/
9Laurent’as Fabiusas/lɐu̯ˈrʲɛn.tɐs ˈfɐ.bʲu.sɐs/
10Ralphas Fiennesas /ˈral.fɐs ˈfʲɛn.nɛ.sɐs/
Table 2. Sample patterns extracted from web-scraped Lithuanian news texts.
Table 2. Sample patterns extracted from web-scraped Lithuanian news texts.
No.Pattern, OccurrencesRemarks
1Andrew Osagie (Didžioji Britanija),1Unintended match:
person (location)
2Žygimantas Augustas (Vytautas Rumšas),1Unintended match: person (actor)
3Žygimantas Šeštokas (Utenos Juventus), 1Unintended match:
person (sports team)
4Andrius Bernotas (Tikėjimo Žodis), 1Unintended match: person (church)
5Butano Karalystėje (Pietų Azija), 1Unintended match:
location (location)
6Baracko Obamos (Barako Obamos), 358Original first
7Barakas Obama (Barack Obama), 590 Role inversion, transcription first
8Bušas Džordžas (George Bush), 1Word order inversion
9Alicios Silverstone (Ališijos Silverstoun), 1Adaptation of original by fusion
10Alastairis Bruce’as (Alisteris Briusas), 2Adaptation by concatenation
11Čarlzą Darviną (Charles Darwin), 3Transcription in accusative
12Čarlzas Darvinas (Charles Darwin), 9Transcription in nominative
13Čarlzo Darvino (Charles Darwin), 10Transcription in genitive
14Čarlzui Darvinui (Charles Darwin), 1Transcription in dative
15Aleksiui Ciprui (Alexis Tsipras), 13Original in nominative,
transcription in dative
16Alberto Alonso (Albertas Alonsas), 1Non-inflected original, transcription in nominative
17Caitlin Cahow (Keitlin Kahou), 2Transcription variation
18Caitlin Cahow (Keitlin Kehou), 1Transcription variation
19Caitlin Cahow (Ketlin Kahau), 2Transcription variation
20Aleksis Cipras (Alexis Tspiras), 1Spelling error in original
21Alavanei Ouatarai (Alassane Ouattara), 1Spelling errors in transcription
Table 3. Complete list of raw patterns and their normalized versions related to the original name Alicia Keys.
Table 3. Complete list of raw patterns and their normalized versions related to the original name Alicia Keys.
Raw Pattern, OccurrencesOperations PerformedNormalized Pattern, Occurrences
Alicia Keyes (Alicija Kis), 2Alignment failure, deleted
Alicia Keys (Ališa Kis), 2 Alicia Keys (Ališa Kis), 2
Alicia Keys (Ališija Kis), 8 Alicia Keys (Ališija Kis), 8
Alicia Keys (Ališija Kys), 5 Alicia Keys (Ališija Kys), 7
Ališija Kys (Alicia Keys), 2Role inverted, merged with preceding pattern
Alicijos Kys (Alicia Keys),1 Role inverted, genitive to nominativeAlicia Keys (Alicija Kys), 1
Alisa Kis (Alicia Keys), 1Role invertedAlicia Keys (Alisa Kis), 1
Ališa Kys (Alicia Keys), 1Role invertedAlicia Keys (Ališa Kys), 1
Alisija Kis (Alicia Keys), 1Role invertedAlicia Keys (Alisija Kis), 1
Alisija Kys (Alicia Keys), 1Role invertedAlicia Keys (Alisija Kys), 1
Table 4. The effects of statistical filtering on several frequently occurring names. Frequencies are shown in parentheses.
Table 4. The effects of statistical filtering on several frequently occurring names. Frequencies are shown in parentheses.
OriginalKept TranscriptionsDiscarded Transcriptions
AliciaAlišija (15), Ališa (3), Alisija (2), Alicija (1), Alisa (1)
AndrewEndrius (238), Endriu (122)Andriu (4), Andru (2), Andrevas (1), Andrju (1), Andrėjus (1)
CharlesČarlzas (217), Šarlis (69),
Čarlis (11), Čarlz (5)
Čarlesas (2), Čarlsas (1),
Carlzas (1),
MarieMari (99), Meri (22),
Marija (15), Merė (2)
Mary (1)
MichaelMaiklas (647), Michaelis (129),
Mišelis (10), Mikaelis (7), Michaelas (6)
Michaela (1), Michailas (1),
Michalas (1), Michelis (1)
Table 5. Datasets for the sequence-to-sequence transcription task.
Table 5. Datasets for the sequence-to-sequence transcription task.
Data SetDescriptionTraining InstancesUnique InstancesOracle WER (%)
Raw dataWord pairs133,25468,167
W1Single-word dataset: individual mappings of O1 → T1, O2 → T2239,14352,1677.09
W2Word-pair dataset: O1 O2 → T1 T2118,14951,4295.39
W2+R2W2 augmented with reversed pairs: O2 O1 → T2 T1236,216102,5685.43
W2+R2+W1W2+R2 further augmented with W1472,432152,9796.29
Table 6. Comparison of model accuracy on base and augmented transcription tasks. Estimated word error rate (WER) and stress-adjusted WER (WER-s), along with their 95% confidence intervals, are reported. The bold number indicates a statistically significant difference between the best and the second-best result.
Table 6. Comparison of model accuracy on base and augmented transcription tasks. Estimated word error rate (WER) and stress-adjusted WER (WER-s), along with their 95% confidence intervals, are reported. The bold number indicates a statistically significant difference between the best and the second-best result.
TaskModelBest HyperparametersWER, %WER-s, %
W1WFSTn = 740.1924.08
Encoder–decoderBS = 1024, DOUT = 0.326.87 ± 0.4321.56 ± 0.35
TransformerBS = 1024, DOUT = 0.325.49 ± 0.2519.81 ± 0.22
W2WFSTn = 823.6716.91
Encoder–decoderBS = 1024, DOUT = 0.320.22 ± 0.416.38 ± 0.24
TransformerBS = 256, DOUT = 0.319.98 ± 0.1916.19 ± 0.14
W2+R2WFSTn = 924.2917.97
Encoder–decoderBS = 1024, DOUT = 0.319.81 ± 0.2515.67 ± 0.24
TransformerBS = 256, DOUT = 0.319.04 ± 0.1815.66 ± 0.12
W2+R2+WFSTn = 924.5518.13
W1Encoder–decoderBS = 1024, DOUT = 0.319.91 ± 0.2716.27 ± 0.3
TransformerBS = 1024, DOUT = 0.119.30 ± 0.1716.16 ± 0.17
Table 7. Categories of transcription errors.
Table 7. Categories of transcription errors.
Error CategoryDescription
No error (0)No mismatch.
Small (s)Minor, likely imperceptible differences, such as stress-type changes affecting vowel length (e.g., Bãzas vs. Bàzas for “Buzz”), or substitution of similar vowels (e.g., Frỹdman vs. Frỹdmen for “Friedman”).
Medium (m)Noticeable differences, such as incorrect stress placement (e.g., Solečìto vs. Solèčito for “Sollecito”) or mild pronunciation shifts (e.g., Òpra vs. Òupra for “Oprah”).
Large (l)Prominent mismatches involving consonants, accented vowels, or the insertion/deletion/substitution of key phonetic elements (e.g., Marìja vs. Merãja, Sáid vs. Sìd, Kárl vs. Kárlos, Minãj vs. Minãž for “Mariah,” “Syd,” “Carlos,” and “Minaj,” respectively).
Table 8. Cross-tabulation of reference and prediction mismatches with respect to the golden standard. False errors (green, 13%) occur when the model’s prediction matches the golden standard. Cases where both the prediction and the reference are different from the golden standard and prediction are better (light green, 4%), comparable (27.3%), or worse (light red, 9.5%) than the reference according to the mismatch severity. True errors (red, 44%) occur when the reference matches the golden standard while the prediction does not.
Table 8. Cross-tabulation of reference and prediction mismatches with respect to the golden standard. False errors (green, 13%) occur when the model’s prediction matches the golden standard. Cases where both the prediction and the reference are different from the golden standard and prediction are better (light green, 4%), comparable (27.3%), or worse (light red, 9.5%) than the reference according to the mismatch severity. True errors (red, 44%) occur when the reference matches the golden standard while the prediction does not.
Prediction Mismatch
0sml
02414133311
References105843940
Mismatchm252112726
l1318690
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Raškinis, G.; Amilevičius, D.; Kalinauskaitė, D.; Mickus, A.; Vitkutė-Adžgauskienė, D.; Čenys, A.; Krilavičius, T. A Semi-Automatic Framework for Practical Transcription of Foreign Person Names in Lithuanian. Mathematics 2025, 13, 2107. https://doi.org/10.3390/math13132107

AMA Style

Raškinis G, Amilevičius D, Kalinauskaitė D, Mickus A, Vitkutė-Adžgauskienė D, Čenys A, Krilavičius T. A Semi-Automatic Framework for Practical Transcription of Foreign Person Names in Lithuanian. Mathematics. 2025; 13(13):2107. https://doi.org/10.3390/math13132107

Chicago/Turabian Style

Raškinis, Gailius, Darius Amilevičius, Danguolė Kalinauskaitė, Artūras Mickus, Daiva Vitkutė-Adžgauskienė, Antanas Čenys, and Tomas Krilavičius. 2025. "A Semi-Automatic Framework for Practical Transcription of Foreign Person Names in Lithuanian" Mathematics 13, no. 13: 2107. https://doi.org/10.3390/math13132107

APA Style

Raškinis, G., Amilevičius, D., Kalinauskaitė, D., Mickus, A., Vitkutė-Adžgauskienė, D., Čenys, A., & Krilavičius, T. (2025). A Semi-Automatic Framework for Practical Transcription of Foreign Person Names in Lithuanian. Mathematics, 13(13), 2107. https://doi.org/10.3390/math13132107

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop