A Semi-Automatic Framework for Practical Transcription of Foreign Person Names in Lithuanian

Raškinis, Gailius; Amilevičius, Darius; Kalinauskaitė, Danguolė; Mickus, Artūras; Vitkutė-Adžgauskienė, Daiva; Čenys, Antanas; Krilavičius, Tomas

doi:10.3390/math13132107

Open AccessArticle

A Semi-Automatic Framework for Practical Transcription of Foreign Person Names in Lithuanian

by

Gailius Raškinis

^1,2,

Darius Amilevičius

^1,2,

Danguolė Kalinauskaitė

¹,

Artūras Mickus

¹,

Daiva Vitkutė-Adžgauskienė

¹

,

Antanas Čenys

³

and

Tomas Krilavičius

^1,*

¹

Faculty of Informatics, Vytautas Magnus University, Universiteto str. 10–202, Kaunas District, 53361 Akademija, Lithuania

²

Institute of Digital Resources and Interdisciplinary Research, Vytautas Magnus University, S. Daukanto str. 27, 44249 Kaunas, Lithuania

³

Department of Information Systems, Vilnius Gediminas Technical University, 10223 Vilnius, Lithuania

^*

Author to whom correspondence should be addressed.

Mathematics 2025, 13(13), 2107; https://doi.org/10.3390/math13132107

Submission received: 8 May 2025 / Revised: 16 June 2025 / Accepted: 23 June 2025 / Published: 27 June 2025

(This article belongs to the Section E1: Mathematics and Computer Science)

Download

Browse Figures

Versions Notes

Abstract

We present a semi-automatic framework for transcribing foreign personal names into Lithuanian, aimed at reducing pronunciation errors in text-to-speech systems. Focusing on noisy, web-crawled data, the pipeline combines rule-based filtering, morphological normalization, and manual stress annotation—the only non-automated step—to generate training data for character-level transcription models. We evaluate three approaches: a weighted finite-state transducer (WFST), an LSTM-based sequence-to-sequence model with attention, and a Transformer model optimized for character transduction. Results show that word-pair models outperform single-word models, with the Transformer achieving the best performance (19.04% WER) on a cleaned and augmented dataset. Data augmentation via word order reversal proved effective, while combining single-word and word-pair training offered limited gains. Despite filtering, residual noise persists, with 54% of outputs showing some error, though only 11% were perceptually significant.

Keywords:

practical transcription; character-level transduction; sequence-to-sequence learning; web-crawled data; Lithuanian

MSC:

68T20

1. Introduction

Monolingual text-to-speech (TTS) systems often face difficulties in accurately pronouncing text segments originating from languages other than the system’s target language. Personal names, geographical names, organizations, and other foreign-named entities are major sources of pronunciation errors in TTS. Since foreign words follow different grapheme-to-phoneme (G2P) conversion rules than those of the target language, they must first be identified within the text. Their orthography should then be modified in a way that enables the embedded G2P module to approximate their pronunciation using the phonemic inventory of the target language. This process is referred to as practical transcription [1,2].

This paper focuses on the process of building a data processing pipeline that takes data found on the web as an input and results in the computational models performing automatic transcription of personal names of foreign origin into Lithuanian, which serves as the target language. Several challenges complicate this task: (1) the variability in how initial morphological adaptation of foreign names is already performed in texts, (2) the need to infer both the position and the type of word stress, and (3) the requirement to consider broader linguistic context—extending beyond individual words—during decision-making.

Foreign personal names often undergo primary morphological adaptation to conform to Lithuanian grammatical structures. This is typically achieved by modifying the name and/or appending Lithuanian inflectional suffixes. The State Commission of the Lithuanian Language (State Commission of the Lithuanian Language is the body that issues normative recommendations for the usage of Lithuanian in public, see https://vlkk.lt/vlkk-nutarimai/protokoliniai-nutarimai/rekomendacija-del-autentisku-asmenvardziu-gramatinimo, accessed on 25 November 2024) prescribes different adaptation strategies depending on factors such as the source language, the gender of the person, and the morphological structure of the name. Consequently, texts processed by TTS systems may contain unaltered foreign names (e.g., Silverstone), names appended with Lithuanian suffixes (e.g., Alastairis, Bruce’as), or names transformed by fusion with Lithuanian morphological endings (e.g., Alicios, the genitive form of Alicia). The latter types of names are no longer tokens in original language but mixed-language tokens. Despite these modifications, we refer to all such names as “original names,” as they appear in their raw form from the perspective of the TTS system.

Word stress is a critical perceptual feature in spoken Lithuanian, as it enhances the prominence of the stressed syllable. Lithuanian distinguishes between acute and circumflex accents, marked with specific diacritics over the stressed sound. However, such stress markings are generally absent from standard orthography and are only used in specialized texts where correct pronunciation is essential. This presents an additional challenge for transcription models, which must not only rewrite the text to align with Lithuanian G2P rules but also predict the correct stress position and accent type.

A further complication arises from the fact that transcription is often ambiguous or exhibits one-to-many mappings, particularly when the origin language of a name is unknown. For example, the name Charles may be transcribed as Čárlzas if the individual is English, or Šárlis if French. Similarly, Michael may become Máiklas (English) or Michaèlis (German). One approach to resolving such ambiguities is to incorporate a broader linguistic context—potentially involving multiple adjacent foreign words—into the transcription process.

Several approaches can be considered to address the multilingual proper name transcription problem, which we categorize as follows: (1) the multi-stage approach, (2) the Generative AI approach, and (3) the end-to-end machine learning approach.

1.1. Multi-Stage Approach

In the multi-stage approach, the transcription task is divided into several sequential steps: (i) identification of the source language of loanwords, (ii) text rewriting based on the identified source language, and (iii) stress placement. This approach has certain advantages. First, the initial step can be framed as a variant of the well-established language identification (LangID) task, for which numerous methods exist, including Naïve Bayes trained on n-grams [3], using sub-word features and vector quantization [4], neural networks [5], transformers [6], and LSTM-based models [7,8]. Second, the rewriting step can potentially achieve high accuracy if it leverages a comprehensive set of deterministic rules for adapting foreign names in the target language.

However, this approach also has significant limitations:

Proper names are very short text segments, and LangID performance is known to degrade on short text segments [9,10].
The TTS system needs to identify and to transcribe mixed-language tokens. This poses a challenge, as no off-the-shelf LangID tools are designed to handle such inputs. Addressing this requires developing a specialized LangID system, for which neither pre-defined rules nor training data currently exist.
For many languages, including Lithuanian, no complete and explicit rule set exists for adapting foreign names—only broad linguistic guidelines are available. Constructing such a rule set would require substantial linguistic expertise and manual effort.
Error propagation is inherent to multi-stage approaches: mistakes in earlier stages (e.g., misidentified language) are likely to impact subsequent steps, thereby degrading the overall system performance.

1.2. Generative AI Approach

In the Generative AI approach, the transcription task is divided into two stages: (i) querying a large language model (LLM) to generate the International Phonetic Alphabet [11] (IPA) transcription of a given personal name, and (ii) converting the resulting IPA representation into Lithuanian orthography using a specialized IPA-to-grapheme converter. The main advantages of this approach include the wide availability of pre-trained LLMs and the relative ease of developing an IPA-to-grapheme converter, which can be based on accurate and deterministic phoneme-to-letter conversion rules.

Table 1 presents several examples of original names and their corresponding IPA transcriptions, as generated by GPT-4o [12], in response to queries such as “What is the IPA pronunciation of [person’s name]?” and “Please transcribe the following name in IPA symbols: [person’s name].”

As seen in rows 1–8, the LLM generally provides accurate transcriptions for unmodified personal names, regardless of the language of origin. However, for mixed-language tokens that have already undergone some degree of morphological adaptation—common in Lithuanian texts—such as in rows 9 and 10, the model fails to produce correct IPA forms (the acceptable transcriptions of Laurent’as Fabiusas and Ralphas Fiennesas are /lʲɔˈra:nɐs fɐˈbʲʊsɐs/ /²ˈreɪfɐs ˈfaɪnzɐs/, respectively). This limitation indicates that pre-trained LLMs alone may not be sufficient for the transcription task, particularly when handling non-standard forms. Fine-tuning [13] or instruction-tuning LLMs [14,15] could potentially improve their performance on the task in question.

1.3. End-to-End Machine Learning Approach

In the end-to-end machine learning approach, the transcription task is performed in a single step, following a preparatory (offline) phase in which a model is trained on a dataset of input–output pairs—original names and their corresponding transcriptions. The trained model effectively encodes a deterministic rule set that implicitly integrates language identification, text rewriting, and word stress placement.

The primary advantage of this approach is that, once trained, the model can be directly applied to previously unseen inputs with minimal additional processing. However, there are several disadvantages: (i) substantial effort may be required to compile a high-quality, representative, and sufficiently large training dataset; (ii) the learned rules are typically encoded in model parameters (e.g., neural network weights), which are not easily interpretable or modifiable; and (iii) any errors or biases in the training data may be propagated throughout the model’s predictions, with limited opportunities for targeted correction.

While acknowledging the validity of the first two approaches described in Section 1.2 and Section 1.3, we adopt the latter approach and formulate transcription as a supervised end-to-end machine learning problem. This choice is based on the assumption that it is feasible to collect a sufficiently large and representative dataset of personal names—originating from various source languages—and their corresponding Lithuanian transcriptions from online sources. Our objective is to investigate methods for cleaning and structuring this data, as well as to train and evaluate different models capable of learning the mapping from original to transcribed forms.

This research seeks to address the following key questions:

Is it possible to develop a fully automated data processing pipeline that converts raw web-crawled data into a dataset of sufficient quality for training practical transcription models?
What level of transcription accuracy can be achieved using such automatically generated data, and how does this accuracy compare to that of human transcribers?
To what extent do models trained on word pairs outperform those trained on single-word inputs?
How effective are different data augmentation strategies in enhancing the performance of transcription models?

To the best of our knowledge, the task of training end-to-end models for the transcription of foreign words into Lithuanian has not been addressed in previous research.

1.4. Related Work

Early approaches to translation and transcription relied on handcrafted, context-dependent phonological rules informed by linguistic expertise [16,17,18]. While these methods offered transparency and control, they were labor-intensive and lacked robustness across language domains [19]. The shift toward data-driven techniques began with statistical machine translation models such as n-gram and maximum entropy models [20], which leveraged aligned parallel translation corpora. However, these approaches struggled in low-resource settings due to data sparsity.

Modern translation methods primarily utilize sequence-to-sequence (seq2seq) models, which revolutionized the handling of variable-length input and output sequences. The introduction of encoder–decoder architectures based on recurrent neural networks (RNNs) by Sutskever et al. [21] enabled end-to-end learning for machine translation tasks. The subsequent addition of attention mechanisms by Bahdanau et al. [22] significantly improved performance by allowing models to dynamically focus on relevant parts of the input during decoding.

Due to the limitations of RNNs in modeling long-range dependencies, alternative architectures emerged, including convolutional neural networks [23] and self-attention-based models. The Transformer model introduced by Vaswani et al. [24] replaced recurrence with multi-head self-attention, enabling parallelization and achieving state-of-the-art results across various translation tasks. These models also support transfer learning across scripts and languages, with multilingual pretrained encoders such as mBERT and XLM-R proving especially effective [25,26].

Character-level translation, such as the transcription task described in this paper, is still largely dominated by attention-based LSTM seq2seq models [27,28], largely because the task typically involves monotonic alignments between input and output sequences that share significant character-level similarity. Unlike tasks requiring semantic understanding or long-range dependencies, transcription benefits less from the Transformer’s memory capacity. Nevertheless, Ref. [29] demonstrated that transformer performance on character-level tasks is highly sensitive to batch size, and that, given sufficiently large batches, transformers can outperform RNN-based models even in these settings.

In this research, we train seq2seq models—including LSTM-based encoder–decoder architectures [28] and transformers [29], both of which have shown strong results in multilingual grapheme-to-phoneme (G2P) tasks [30]—to solve the transcription problem. We also investigate the effects of data augmentation, which has been shown to improve model accuracy in related tasks. Differently from Ref. [30], our task presents additional challenges: we begin with a much larger and noisier dataset, must handle conflicting transcription labels, and incorporate stress marker prediction, adding complexity to both data preprocessing and model training.

The main contributions of this paper are as follows:

We propose a novel semi-automatic data processing pipeline that transforms raw web-crawled data into a training set for practical transcription tasks. This pipeline is demonstrated on Lithuanian—a morphologically rich and challenging language—where it effectively processes, filters, and normalizes data containing inflected and/or mixed-language tokens. The pipeline is potentially fully automatable for languages that do not require modeling of word stress type and location.
We show that an end-to-end practical transcription model can be trained using this dataset, despite residual noise in the processed data. Although exact word error rate (WER) estimation is hindered by noisy reference labels in the test set, we report an upper-bound WER of approximately 19%, with the actual performance likely being significantly better—potentially nearly half that value.

2. Materials and Methods

2.1. Data

We hypothesized that data for the transcription task could be collected from online sources. It has been observed that Lithuanian news portals frequently present foreign person names in their original form followed by a transcription (intended pronunciation) in parentheses. This practice helps readers unfamiliar with foreign orthography approximate the correct pronunciation.

To leverage this pattern, we performed web scraping across major Lithuanian news portals, collecting 133,254 text segments (68,167 unique patterns) from public articles spanning the past ten years. The data was extracted using a Perl-style regular expression designed to capture two title-cased words followed by another pair of title-cased words enclosed in parentheses (the raw material is made openly available and can be downloaded from Clarin LT repository at https://clarin.vdu.lt/xmlui/handle/20.500.11821/68, accessed on 7 May 2025).

          [[:upper:]][-\x27[:lower:]]+\s+[[:upper:]][-\x27[:lower:]]+\s+\([[:upper:]][-\x27[:lower:]]+\s+[[:upper:]][-\x27[:lower:]]+\s*\)

This raw dataset [31], however, proved to be noisy and ambiguous, requiring extensive preprocessing before it could be used for model training. The main sources of noise included:

Unintended Matches: The regular expression sometimes captured pairs of proper nouns that were not person names and their transcriptions, but rather unrelated entities, such as person–location, person–actor, person–sports team, or location–location combinations (see Table 2, rows 1–5).
Role Inversion: In some cases, the first name pair represented the original name (Table 2, rows 6, 9, 10, 16), while in others, it was the transcription (Table 2, rows 7, 8, 11–15).
Word Order Inversion: The word order in one of the name pairs was sometimes reversed relative to the other (see Table 2, row 8).
Inflection Mismatch: The original name with concatenated or fused inflectional suffixes could be in a different grammatical case than the transcribed name. For instance, the original name Alexis Tsipras is in the nominative, while the transcription Aleksiui Ciprui is in the dative case (Table 2, row 15). Particularly challenging were these mismatches when intertwined with role inversion. For example, the name pair Alberto Alonso (Table 2, row 16) could be interpreted as a non-inflected original form, an adapted original form (e.g., genitive of Albert Alons), or a transcription in the genitive case.
Multiple Inflections: A single non-inflected original name could correspond to multiple transcriptions, each in a different grammatical case depending on the context (Table 2, rows 11–14).
Inconsistent Labeling and Human Errors: Different transcriptions of the same original name were observed due to the varying linguistic knowledge of human editors (Table 2, rows 17–19). Additionally, spelling errors appeared in both original names (Table 2, row 20) and transcriptions (Table 2, row 21).

2.2. Method

Our transcription system follows a structured data processing pipeline, as illustrated in Figure 1. It begins with web-scraped textual data and ends in computational models capable of automatically transcribing personal names of foreign origin into Lithuanian.

The methodology consists of several sequential steps:

Preprocessing and cleaning of web-scraped data, including filtering, normalization, and reordering of raw data patterns.
Adding word stress location and type information to the cleaned data.
Generation of multiple training sets, based on different configurations and augmentation strategies.
Training of machine learning models, using a range of architectures and training setups and evaluation of their accuracy on held-out test data.

2.2.1. Data Preprocessing

Preprocessing was essential to address the various noise sources described in Section 2.1 and to prepare a clean, structured dataset for model training.

The first preprocessing step involved alignment of original and transcribed strings, which served as a filtering mechanism for the raw patterns. This alignment assumed that valid transcriptions share structural and phonetic similarities—specifically, that the two strings should be composed of corresponding characters or character groups.

For instance, Table 2 (rows 1–5) illustrates clearly invalid alignments, where the initial capital letters of the first and second words in each pair do not match the capitals in the parenthetical pair. However, this matching should not be interpreted as strict character identity, since valid transcriptions often include predictable orthographic adaptations (see rows 11, 15, and 17) (for instance, matching substrings Ch, Ts, C and G to Č, C, K, and Dž, respectively, should be permitted).

To accommodate this, we implemented a dynamic programming alignment procedure called Match(). It identifies correspondences between the original and transcribed forms based on approximately 600 permitted edit operations, including both context-free and context-sensitive substitutions. These operations were non-symmetric—for example, replacing “ault” with “o” is allowed (as in Renault → Reno), but the reverse is not permitted, since such transformations are not linguistically plausible in the Lithuanian context.

Alignment was further complicated by morphological variation: both the original and transcribed names could appear in various inflected forms, sometimes with differing grammatical cases and suffixes (e.g., Table 2, rows 11 and 15). To mitigate this, morphological stemming was applied before alignment, reducing both strings to their stemmed forms. Due to possible ambiguities in stemming, a search over the space of candidate stem pairs was necessary.

The simplified pseudo-code below (see Figure 2) illustrates the logic of the NormalizeWordOrder() function. This function takes four word stems: s1, s2, s3, and s4, and infers the roles (original or transcription) and the correct word order based on successful alignments.

Pattern filtering based on the alignment between the original text and its transcription helped eliminate incorrect transcription patterns and normalize instances of role and word order inversion. It also contributed to filtering out some patterns containing spelling errors (e.g., rows 20–21 in Table 2). However, many inconsistent transcriptions remained (see Table 3). This was largely due to the aligner having too much flexibility, allowing all types of editing operations to be applied simultaneously—even when those operations were more appropriate for different languages.

The process of morphological analysis, inflectional paradigm inference, and case normalization significantly reduced both inflectional mismatches and the occurrence of multiple transcriptions derived from a single original name—common in Lithuanian due to the presence of inflectional suffixes. This analysis was performed using purpose-built tools that relied on inflectional endings to make decisions.

As a result of this step, the cases listed in Table 2 (rows 11–14) were consolidated into a single normalized form: Charles Darwin (Čarlz Darvin), appearing 23 times in total.

To further minimize the issue of one-to-many transcription mappings, statistical pattern filtering was applied. A simple rule was introduced: transcriptions occurring less frequently than a specified threshold were discarded (see Table 4). This threshold was empirically set to 5% of the total number of occurrences of a given original name. The goal was to strike a balance between removing as much human-induced noise as possible while preserving legitimate transcription variants found across different languages.

2.2.2. Gathering Stress Data

In Lithuanian, stress placement and type are determined by the morphological properties of a word form. While there are algorithms that can infer stress location and type from Lithuanian orthography [32,33] and the practical implementations available [34], these techniques are dictionary-based and require that word forms be annotated with their morphological properties and accentuation paradigms. However, both the original and the transcribed word forms in our study fall outside the scope of standard Lithuanian dictionaries.

As a result, stress annotation had to be performed manually, making it the only step in our data processing pipeline that prevented it from being fully automatic. Linguists were instructed to place stress marks on the transcribed forms to best reflect the pronunciation of the original language. To speed up the process, stress was annotated only on isolated, single-word transcriptions in the nominative case. These annotations were then algorithmically propagated to corresponding word-pair patterns and other inflected forms.

This manual annotation process inevitably introduced some noise into the data. The linguists were not proficient in the pronunciation and accentuation of all source languages encountered in the dataset, and in many cases had to choose between multiple plausible options. For example, the name Marielė, which may originate from English (Mariel) or French (Marielle), would require different stress placements—Mãrielė for the English version and Marièlė for the French one. Without context, such distinctions were impossible to make.

2.2.3. Generating Training and Test Sets

The filtered, normalized, and stressed data comprised 118,149 total patterns (51,429 unique ones), each of the form O1 O2 → T1 T2, where O1 and O2 are original personal names—written either with or without primary morphological adaptation—and T1 and T2 are their respective transcriptions.

Two base datasets, W1 and W2, were derived from this normalized data, as shown in Table 5. To evaluate the effectiveness of data augmentation in improving transcription accuracy, two additional augmented datasets—W2+R2 and W2+R2+W1—were created.

We chose to retain conflicting transcription labels, as automatically removing them would have required excluding instances from less frequently used languages. Importantly, the frequency with which a pattern appears in the dataset is likely correlated with the correctness of its transcription. Therefore, we opted not to reduce the datasets to unique patterns. Instead, instances were preserved with their original frequencies, allowing repetition.

Because of these conflicting labels, the maximum achievable accuracy is less than 100%. To estimate an upper bound, we calculated the Oracle accuracy, which assumes perfect prediction for non-conflicting cases and selects the most frequent label in cases of conflict (see Table 5).

All datasets were randomly split into training (90%) and testing (10%) subsets. To maintain consistency, the split ensured that not only identical patterns but also all inflectional variants of a given pattern remained within the same partition. For instance, the accusative and genitive forms:

Louisą Zamperini → Luĩsą Zamperìni

Louiso Zamperini → Luĩso Zamperìnio

were placed in the same partition of the W2 dataset. However, this constraint was not enforced at the single-word level; thus, data instances like O1 O2 → T1 T2 and O1 O3 → T1 T3 could appear in the same split even if they shared the same sub-component O1→ T1.

The W2 dataset was partitioned prior to augmentation, ensuring that W2, W2+R2, and W2+R2+W1 shared the same patterns in training and test subsets. This allowed for fair comparisons between non-augmented and augmented models, all evaluated using W2’s test set.

In all experiments, we encoded Lithuanian stress markers as unique symbols appended after the stressed character. Before training, each instance was split into individual characters. The input symbol set contained 64 unique characters, while the output symbol set contained 38.

2.2.4. Training Transcription Models

We trained three different transcription models, each representing a distinct approach to sequence modeling: a symbolic model based on weighted finite-state transducers (WFSTs), a recurrent neural network (RNN)-based sequence-to-sequence (seq2seq) model, and a Transformer-based seq2seq model. These models were selected to explore the trade-offs between model interpretability, training efficiency, and transcription accuracy.

Our first model was a weighted finite-state transducer (WFST), a symbolic approach rooted in probabilistic automata. The model architecture follows the pair n-gram framework [35] and builds on earlier work such as the hidden Markov model-based G2P framework [36]. As in Lee et al. [37], transcriptions are generated by composing the input string with a trained transducer, which yields a lattice of possible output sequences annotated with their associated probabilities. The best hypothesis is then selected using Viterbi decoding [38].

We used the OpenGrm [39] tookit and OpenFst-based [40] Pynini library [41] to train the transducer. Different n-gram orders (n = 2 to 10) were evaluated to balance context sensitivity with model complexity.

The second model we explored is a neural seq2seq architecture based on long short-term memory (LSTM) units. The encoder consisted of a single bidirectional LSTM layer, while the decoder was a single unidirectional LSTM. An attention mechanism [27] bridges the encoder and decoder, enabling the model to dynamically focus on relevant parts of the input sequence during decoding.

We used the Fairseq [42] implementation provided by Ref. [30], with default hyperparameter settings for key training parameters such as learning rate, weight decay, gradient clipping, and label smoothing [43]. For training, we used an Adam optimizer with inverse square root learning rate scheduling and applied early stopping based on validation loss.

Our third model was a Transformer-based neural seq2seq model, optimized for character-level transduction and described by Wu et al. [29]. This architecture replaces recurrence with multi-head self-attention mechanisms and position-wise feedforward layers, enabling efficient parallel training and improved modeling of long-range dependencies.

We adopted a four-layer Transformer architecture for both the encoder and the decoder, each using pre-layer normalization to improve training stability. Again, we used the Fairseq implementation provided by Ref. [30]. Dropout and label smoothing were applied to mitigate overfitting.

For both neural models and for each data set, we conducted hyperparameter tuning on a held-out development set. The following parameters were adjusted: the dimensionality of the encoder embedding layer (EEL), encoder hidden layer (EHL), decoder embedding layer (DEL), and decoder hidden layer (DHL), along with the dropout rate (DOUT) and batch size (BSIZE). Grid search was used to identify optimal configurations. The beam search decoding algorithm was employed with beam sizes ranging from 3 to 10, depending on the model size and sequence complexity. Hyperparameter settings were kept consistent across models as much as possible to ensure fair comparison. Detailed results of the ablation experiments for the Transformer model are provided in Appendix A.

2.3. Evaluation Metrics

To evaluate and compare the performance of the transcription models, we employed word error rate (WER) as the primary metric. WER was defined as the proportion of word instances in which the predicted transcription differed from the reference transcription, with a lower WER indicating superior model performance. In the calculation of WER, each word-pair instance was treated as comprising two discrete word forms.

To further disentangle the contribution of stress placement errors from other symbol-level errors within the WER, we introduced an auxiliary metric termed the stress-compensated word error rate (WER-s). The WER-s is computed analogously to the WER, but after all stress markers have been removed from both the predicted and the reference transcriptions. The difference between the WER and the WER-s thus reflects the proportion of total errors attributable exclusively to incorrect stress placement (i.e., position and/or type).

3. Results

3.1. Model Accuracy

Weighted finite-state transducer, neural encoder–decoder, and transformer models were successfully trained. On the training sets, the word error rates (WERs) achieved by these models were close to the oracle WER values reported in Table 5. However, their performance on the test sets showed significantly higher WERs, indicating a drop in generalization. Table 6 presents the best results obtained for each model type and dataset, based on a search over the respective hyperparameter spaces.

The WFST model demonstrated the lowest performance among the evaluated approaches. The considerable gap between the WER and the WER-s (see Figure 3) indicates that it struggled particularly with accurately predicting stress location and type compared to the neural models. Detailed analysis of its output revealed frequent issues, such as missing or multiple stress markers, even though each training instance contained exactly one stress marker. Additionally, data augmentation negatively affected the WFST model, likely due to increased distributional mismatch between the training and test data.

In contrast, both neural models—encoder–decoder and transformer—performed significantly better. Across all training sets, the best or near-best results were achieved using a smaller encoder (EEL = 128, EHL = 512) and a larger decoder (DEL = 256, DHL = 1024).

Overall, the transformer model outperformed the encoder–decoder model, particularly in the single-word transcription task (W1), where the differences in WER and WER-s are statistically significant. For augmented word-pair tasks, the transformer also achieved a significantly lower WER. However, the differences in WER-s were not statistically significant, suggesting that both models are comparable in basic character-to-character transcription, while the transformer demonstrates superior handling of stress regularities.

Interestingly, both models performed worse on the W2+R2+W1 dataset compared to W2+R2, suggesting that not all types of data augmentation are beneficial. We attribute this decline to the mismatch between training and test data distributions—specifically, the inclusion of single-word training instances in W2+R2+W1, while the test set contained only word-pair instances.

3.2. Detailed Error Analysis

Since the training datasets were generated via a semi-automated data processing pipeline, we conducted a more in-depth investigation into the nature of transcription errors. The term “error” can be misleading in this context, as discrepancies between the reference transcription (in the test set) and the model’s prediction are not necessarily genuine mistakes—particularly when the reference itself is incorrect.

To explore this further, we selected the 100 most frequent unique transcription mismatches (representing 9.4% of the total error mass) made by the best-performing model (a Transformer trained on the W2+R2 dataset) and attempted to systematically categorize them.

First, we manually reviewed and corrected the reference transcriptions where necessary. These 100 adjusted entries form what we refer to as the golden standard. Next, we categorized the observed mismatches—both between the original references and the golden standard, and between the model predictions and the golden standard—into four levels of severity, as shown in Table 7.

The distribution of these mismatches is captured in Table 8, which cross-tabulates the reference vs. predicted errors by category:

Assuming this sample is representative of the full dataset, we can presume that the training and test data still contain a significant amount of noise. In 53.9% of test instances, the reference transcription deviates from the golden standard even though only 11.5% of instances show perceptually significant mismatches. This suggests that the pattern-filtering techniques used in the preprocessing pipeline (as described in Section 2.2.1) are effective at reducing severe errors but allow smaller inconsistencies to persist.

Consequently, the WER and WER-s metrics reported earlier should be interpreted with caution. They likely represent pessimistic estimates of actual model performance.

Analysis of the largest error category—perceptually prominent mismatches—reveals several recurring error types. The most common are “language misidentification” (the term language identification may be imprecise in this context, as the inputs often consist of mixed-language tokens rather than text in a single, clearly defined language) errors, as if the model had applied transcription rules from a more frequent language (typically English) to names from less represented languages. Examples include:

Tusk (Polish surname) transcribed as Tãsk, as if it were English.

Mujica (Uruguayan) rendered as Mužìka instead of Muchìka, influenced by French.

Sergio (Italian) misrendered as Ser̃chijo instead of Ser̃džijo, as if it were Spanish.

Some other errors result from missing diacritics in the original names. For instance, Walesa becomes Valèsa instead of the correct Valènsa or Valeñsa. The correct source should have been Wałęsa, which carries diacritic marks crucial for accurate transcription.

The full list of the top 100 transcription errors, including their golden standard corrections and assigned categories, is provided in Appendix B.

4. Conclusions

This study represents our initial investigation into the transcription of proper names in Lithuanian. We developed a data processing pipeline to transform raw web-crawled data into a training set suitable for this practical transcription task. For Lithuanian, the pipeline is semi-automatic due to the requirement for manual stress annotation by human labelers. However, for other target languages with more regular stress patterns (e.g., French or Latvian), this step could be omitted, potentially enabling a fully automatic pipeline.

The filtering and normalization techniques applied to the crawled data were effective, reducing noise by 10.7% (based on occurrence frequency) and decreasing the number of pattern types by 24.6%. Residual noise was assessed indirectly through detailed inspection of the 100 most frequent instances in the test set. Approximately 54% of the inspected transcriptions could be improved, though only around 11% required perceptually significant corrections. These findings suggest that the training data quality is sufficient to enable convergence of sequence-to-sequence models, although it remains below the standard typically achieved by human-curated datasets.

Among the models evaluated, the best-performing was a sequence-to-sequence Transformer, which achieved a word error rate (WER) of 19.04%, or 15.66% when accounting for stress compensation, on a word-pair dataset augmented with word order reversals. While this performance still lags behind human transcription accuracy (with an oracle WER of 5.43%), it represents a substantial improvement for text-to-speech (TTS) systems that currently lack any effective loanword adaptation component.

Furthermore, models trained on word pairs outperformed those trained on isolated words. The best-performing single-word model yielded a WER of 25.49% (19.81% stress-compensated), supporting previous findings that sequence-to-sequence models benefit from extended input context. However, to draw more definitive conclusions about the relative effectiveness of model architectures or hyperparameter configurations, further reduction in residual noise in the test data is needed.

We also examined data augmentation strategies, such as word order reversal and combining single-word and word-pair datasets. The results show that word order reversal led to a 5% relative improvement in model performance, while the combination of single-word and word-pair datasets proved less effective.

Several important research and practical questions remain open to future investigation:

Alternative approaches. Exploring alternative approaches to the practical transcription problem, as introduced in Section 1.1 and Section 1.2, remains a promising direction for future work. The Generative AI approach may yield acceptable performance in handling mixed-language tokens, particularly when pre-trained large language models (LLMs) are fine-tuned or instruction-tuned for this specific task. Within the end-to-end machine learning framework, LLMs could also assist in cleaning and normalizing raw data patterns. Furthermore, LLMs may be employed to extract supplementary features—such as speaker nationality or linguistic background—which could enhance the predictive performance of transcription models.
Stress modeling. This study incorporated several arbitrary choices regarding stress modeling. Stress placement was integrated into the end-to-end system, with the stress mark encoded as an additional ASCII symbol following the stressed character. An alternative approach would be to treat stress placement as an independent machine learning task. The stress mark could be embedded within an accented character, forming a single output symbol. This strategy would expand the output symbol set rather than increase the length of the output sequence. Additionally, weak supervision techniques [44] or semi-supervised [45] stress modeling approaches could be explored to partially automate this task.
Experimenting with the dataset. It is important to continue manual cleaning of the test set to establish a fully curated, gold-standard benchmark, thereby increasing confidence in the accuracy estimates. The dataset could also be enriched with longer name sequences (e.g., three or more words) to better reflect the diversity of naming conventions. Additionally, since morphological analysis is already a component of the data processing pipeline, the dataset could be augmented with inflected forms of names not originally present. For example, from the nominative form Charlesas Darwinas → Čarlzas Darvinas, one could derive genitive forms such as Charles’o Darwin’o → Čarlzo Darvino or dative forms such as Charlesui Darwinui → Čarlzui Darvinui.
Improving data filtering. The current alignment procedure appears overly permissive. Although stricter alignment criteria may require additional linguistic resources, a promising improvement would be to assign language labels to each permitted substitution. This would restrict substitutions to those within the same language set, potentially enhancing data consistency, filtering precision, and overall transcription accuracy.

We plan to explore these avenues in subsequent studies, with the goal of building a more robust and linguistically informed transcription framework.

Author Contributions

Conceptualization, G.R. and T.K.; methodology, D.V.-A.; software, T.K. and D.K.; validation, A.M., D.A., D.K. and G.R.; resources, G.R. and D.K.; writing—original draft preparation, G.R.; writing—review and editing, D.V.-A. and A.M.; supervision, G.R. and T.K.; project administration, T.K. and A.Č. All authors have read and agreed to the published version of the manuscript.

Funding

This research/The project was co-funded by the European Union under Horizon Europe programme grant agreement No. 101059903; and by the European Union funds for the period 2021–2027 and the state budget of the Republic of Lithuania financial agreement Nr. 10-042-P-0001.

Data Availability Statement

The original data presented in the study are openly available in Clarin LT repository at https://clarin.vdu.lt/xmlui/handle/20.500.11821/68.

Acknowledgments

The authors would like to express their sincere gratitude to Regina Sabonytė for her valuable linguistic expertise regarding the placement of stress markers in the transcribed personal names.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

WFST	Weighted Finite State Transducer
LSTM	Long Short-Term Memory
WER	Word Error Rate
WER-s	Stress-compensated Word Error Rate
TTS	Text-to-Speech
G2P	Grapheme-to-Phoneme
LangID	Language identification task
LLM	Large Language Model
IPA	International Phonetic Alphabet
GPT	Generative Pre-trained Transformer
RNN	Recurrent Neural Network
ML	Machine Learning
ASCII	American character encoding standard

Appendix A

Table A1. Results of ablation experiments using the Transformer model. EEL, EHL, DEL, and DHL denote the dimensionality of the encoder’s (E) and decoder’s (D) embedding (E) and hidden (H) layers, respectively. The table reports the dependency of the estimated word error rate (WER) and stress-adjusted word error rate (WER-s) on different hyperparameter values.

Task	EEL	EHL	DEL	DHL	Batch Size	Dropout Rate	WER, %	WER-s, %
	128	512	128	512	256	0.1	29.13 ± 0.85	22.99 ± 0.76
	128	512	128	512	256	0.3	26.95 ± 0.29	21.22 ± 0.18
	128	512	128	512	1024	0.1	27.88 ± 0.43	21.51 ± 0.48
	128	512	128	512	1024	0.3	28.44 ± 0.29	22.33 ± 0.19
	128	512	256	1024	256	0.1	27.90 ± 0.19	22.42 ± 0.19
	128	512	256	1024	256	0.3	27.50 ± 0.39	21.96 ± 0.35
	128	512	256	1024	1024	0.1	28.65 ± 0.14	22.46 ± 0.20
W1	128	512	256	1024	1024	0.3	25.49 ± 0.25	19.81 ± 0.22
	256	1024	128	512	256	0.1	27.98 ± 0.59	21.95 ± 0.45
	256	1024	128	512	256	0.3	27.85 ± 0.21	21.71 ± 0.19
	256	1024	128	512	1024	0.1	29.35 ± 0.52	23.20 ± 0.54
	256	1024	128	512	1024	0.3	27.73 ± 0.61	21.32 ± 0.54
	256	1024	256	1024	256	0.1	28.26 ± 0.31	21.69 ± 0.28
	256	1024	256	1024	256	0.3	26.83 ± 0.43	21.16 ± 0.44
	256	1024	256	1024	1024	0.1	28.84 ± 0.26	22.17 ± 0.26
	256	1024	256	1024	1024	0.3	27.22 ± 0.55	21.27 ± 0.59
	128	512	128	512	256	0.1	21.58 ± 0.36	17.31 ± 0.29
	128	512	128	512	256	0.3	21.31 ± 0.14	16.39 ± 0.09
	128	512	128	512	1024	0.1	21.70 ± 0.39	16.89 ± 0.32
	128	512	128	512	1024	0.3	21.88 ± 0.32	17.10 ± 0.17
	128	512	256	1024	256	0.1	20.59 ± 0.24	16.77 ± 0.18
	128	512	256	1024	256	0.3	19.98 ± 0.19	16.19 ± 0.14
	128	512	256	1024	1024	0.1	21.12 ± 0.25	17.10 ± 0.22
W2	128	512	256	1024	1024	0.3	20.81 ± 0.30	16.47 ± 0.22
	256	1024	128	512	256	0.1	21.60 ± 0.18	17.56 ± 0.13
	256	1024	128	512	256	0.3	20.65 ± 0.40	16.14 ± 0.25
	256	1024	128	512	1024	0.1	21.53 ± 0.25	17.49 ± 0.27
	256	1024	128	512	1024	0.3	21.46 ± 0.07	16.90 ± 0.09
	256	1024	256	1024	256	0.1	20.83 ± 0.16	17.11 ± 0.17
	256	1024	256	1024	256	0.3	20.47 ± 0.27	16.90 ± 0.24
	256	1024	256	1024	1024	0.1	21.48 ± 0.29	17.34 ± 0.24
	256	1024	256	1024	1024	0.3	20.60 ± 0.20	16.41 ± 0.09
	128	512	128	512	256	0.1	20.01 ± 0.16	16.22 ± 0.14
	128	512	128	512	256	0.3	20.78 ± 0.14	16.44 ± 0.09
	128	512	128	512	1024	0.1	20.26 ± 0.13	16.52 ± 0.22
	128	512	128	512	1024	0.3	20.78 ± 0.09	16.39 ± 0.06
	128	512	256	1024	256	0.1	20.06 ± 0.14	16.48 ± 0.13
	128	512	256	1024	256	0.3	19.04 ± 0.18	15.66 ± 0.12
	128	512	256	1024	1024	0.1	20.19 ± 0.27	16.49 ± 0.24
W2+R2	128	512	256	1024	1024	0.3	19.54 ± 0.20	16.15 ± 0.08
	256	1024	128	512	256	0.1	20.20 ± 0.24	16.69 ± 0.21
	256	1024	128	512	256	0.3	20.02 ± 0.13	16.12 ± 0.13
	256	1024	128	512	1024	0.1	20.49 ± 0.32	16.68 ± 0.32
	256	1024	128	512	1024	0.3	20.35 ± 0.18	16.25 ± 0.14
	256	1024	256	1024	256	0.1	19.73 ± 0.19	16.24 ± 0.16
	256	1024	256	1024	256	0.3	19.57 ± 0.22	16.40 ± 0.19
	256	1024	256	1024	1024	0.1	20.28 ± 0.30	16.55 ± 0.25
	256	1024	256	1024	1024	0.3	19.96 ± 0.08	16.43 ± 0.10
	128	512	128	512	256	0.1	19.83 ± 0.21	16.25 ± 0.23
	128	512	128	512	256	0.3	20.92 ± 0.06	16.53 ± 0.02
	128	512	128	512	1024	0.1	20.07 ± 0.11	16.42 ± 0.08
	128	512	128	512	1024	0.3	20.91 ± 0.07	16.70 ± 0.06
	128	512	256	1024	256	0.1	19.68 ± 0.19	16.70 ± 0.17
	128	512	256	1024	256	0.3	19.58 ± 0.05	16.02 ± 0.08
	128	512	256	1024	1024	0.1	19.30 ± 0.17	16.16 ± 0.17
W2+R2+R1	128	512	256	1024	1024	0.3	19.86 ± 0.12	16.26 ± 0.09
	256	1024	128	512	256	0.1	19.95 ± 0.23	16.28 ± 0.08
	256	1024	128	512	256	0.3	20.12 ± 0.16	16.25 ± 0.06
	256	1024	128	512	1024	0.1	20.14 ± 0.22	16.38 ± 0.25
	256	1024	128	512	1024	0.3	20.55 ± 0.12	16.58 ± 0.13
	256	1024	256	1024	256	0.1	20.16 ± 0.11	16.77 ± 0.07
	256	1024	256	1024	256	0.3	19.51 ± 0.15	16.30 ± 0.12
	256	1024	256	1024	1024	0.1	19.96 ± 0.11	16.62 ± 0.08
	256	1024	256	1024	1024	0.3	19.51 ± 0.17	16.26 ± 0.15

Appendix B

Table A2. The list of the 100 most frequent mismatches between the transcription reference and prediction, along with their categorization based on the gold standard. The symbols 0, s, m, and l denote a match, small mismatch, medium mismatch, and large mismatch, respectively.

Occ.	Original	Reference	Prediction	Golden Standard	Ref. Cat	Pred. Cat.
9	Alexį Tsiprą	Alèksį Cìprą	Ãleksį Cìprą		0	m
5	Alexio Tsipro	Alèksio Tsìpro	Alèksijo Cìpro	Alèksio Cìpro	s	s
60	Alexio Tsipro	Alèksio Cìpro	Alèksijo Cìpro		0	s
19	Alexis Tsipras	Alèksis Tsìpras	Alèksis Cìpras		s	0
5	Ali Zeidanas	Ãli Zeidãnas	Alì Zeidãnas	Ãli Zeĩdenas	m	m
6	Amy Poehler	Eĩmi Poũler	Eĩmi Pèler		0	l
6	Andrea Bocelli	Andrèa Bočèli	Andrė̃ja Bočèli		0	s
14	Anna Wintour	Ãna Viñtur	Ãna Vintùr		0	s
11	Bambangas Soelistyo	Bambángas Sulìstjo	Bambángas Soelìsto	Bembéngas Sulìstjo	m	l
10	Blake Lively	Bleĩk Láivli	Bleĩk Lìvli		0	l
6	Brendan Gilligan	Bréndan Gìligan	Bréndan Džìligan		0	l
9	Buzz Aldrin	Bãz Òldrin	Bùz Áldrin	Bàz Òldrin	s	l
5	Buzzas Aldrinas	Bãzas Òldrinas	Bùzas Al̃drinas	Bàzas Òldrinas	s	l
14	Carlos Delfino	Kárlos Delfìno	Kárlos Del̃fino		0	m
9	Caroline Kennedy	Kèrolain Kènedi	Karolìn Kènedi		0	l
6	Carrie Prejean	Kèri Preidžán	Kèri Prežán		m	0
6	Chris Cassidy	Krìs Kãsidi	Krìs Kẽsidi		m	0
10	Cindy Crawford	Siñdi Kráuford	Siñdi Kroũford	Siñdi Krõford	s	s
13	Cindy Crawford	Siñdi Kròford	Siñdi Kroũford	Siñdi Krõford	s	s
8	David Lynch	Deĩvid Liñč	Deĩvid Lỹnč		0	s
10	Dilmai Rousseff	Dìlmai Rùsef	Dil̃mai Rùsef	Dil̃mai Rusèf	s	s
101	Donald Tusk	Dònald Tùsk	Dònald Tãsk		0	l
5	Donaldas Tuskas	Dònaldas Tùskas	Dònaldas Tãskas		0	l
5	Ene Ergma	Èn Èrgma	Èn Er̃gma	Ène Èrgma	l	l
5	Geir Lundestad	Geĩr Luñdestad	Geĩr Liùndestad		0	s
6	Yingluck Shinawatra	Jiñglak Činavãta	Iñglak Šinavãtra	Jiñglak Šinavãtra	l	s
15	Yingluck Shinawatra	Jinglùk Šinavãtra	Iñglak Šinavãtra	Jiñglak Šinavãtra	m	s
5	Yingluck Shinawatros	Jiñglak Činavãtos	Iñglak Šinavãtos	Jiñglak Šinavãtros	l	l
5	Yukio Edano	Jùkio Edãno	Jùkijo Edãno		0	0
8	Jean Sibelius	Ján Sibèlijus	Žán Sibèlijus		l	0
5	Jeanui Monnet	Žãnui Monè	Žãnui Monė̃		s	0
8	Jennifer Hudson	Džènifer Hãdson	Džènifer Hàdson		s	0
6	Jerry Rubin	Džèri Rùbin	Džèri Rãbin		0	l
5	Jiroemonas Kimura	Džiroemònas Kimùra	Žirumònas Kimùra		0	l
8	Joakim Noah	Žoakìm Nòa	Joakìm Nòa	Džoũakim Noũa	l	l
8	Johnas Kirby	Džònas Kir̃bi	Džònas Ker̃bi		s	0
10	Johno Kirby	Džòno Ker̃bi	Džòno Kir̃bi		0	s
14	Jose Mujica	Chosė̃ Muchìka	Chosė̃ Mužìka		0	l
6	Josephas Muscatas	Džòzefas Muskãtas	Džòzefas Maskãtas	Džoũzefas Maskãtas	m	s
5	Juan Carlos	Chuán Kárl	Chuán Kárlos		l	0
6	Kei Nishikori	Kèi Nišikòri	Keĩ Nišikòri		0	0
8	Kemalis Kilicdaroglu	Kemãlis Kiličdaròhlu	Kemãlis Kilikdaròhlu		0	l
5	Kenneth Campbell	Kènet Kémbel	Kènet Kémpbel		0	s
6	Kerem Gonlum	Kerèm Geñlium	Kerèm Gònlam		0	l
17	Kianoushas Jahanpouras	Ki-ãnušas Džahanpū̃ras	Ki-ãnušas Džahanpùras	Ki-anùšas Džahanpū̃ras	s	m
10	Konrad Adenauer	Kònrad Ãdenauer	Kònrad Adenáuer		0	s
5	Kurt Vonnegut	Kùrt Vònegut	Kùrt Fònegut	Kùrt Vãnegat	m	l
12	Lance Stephenson	Leñs Stèfenson	Leñs Stìvenson	Leñs Stỹvenson	l	s
14	Laurent Gbagbo	Lòren Gbãgbo	Lorán Gbãgbo	Lorán Gbagbò	s	s
9	Laurent’as Fabiusas	Lorãnas Fãbijusas	Lorãnas Fabiùsas		s	0
21	Lech Walesa	Lèch Valènsa	Lèch Valèsa		0	l
5	Lechą Walesą	Lèchą Valènsą	Lèchą Valèsą		0	l
30	Lechas Walesa	Lèchas Valènsa	Lèchas Valèsa		0	l
10	Lecho Walesos	Lècho Valènsos	Lècho Valèsos		0	l
6	Mariah Carey	Marãja Kèri	Marìja Kèri	Merãja Kèri	s	l
18	Mariah Carey	Merãja Kèri	Marìja Kèri		0	l
12	Marie Trintignant	Marì Trentinján	Marỹ Trintinján	Marỹ Trentinján	s	s
6	Marisol Touraine	Marizòl Tureñ	Marisòl Turèn		m	0
6	Marissa Mayer	Marìsa Mèjer	Marìsa Mãjer		s	0
10	Marlon Brando	Mar̃lon Brándo	Mar̃lon Breñdo	Márlon Bréndou	s	s
11	Michel Hazanavicius	Mišèl Hazanãvičius	Mišèl Azanãvičius		s	0
6	Milton Friedman	Mil̃ton Friẽdman	Mil̃ton Frìdman	Mil̃ton Frỹdmen	l	m
6	Milton Friedman	Mil̃ton Frỹdman	Mil̃ton Frìdman	Mil̃ton Frỹdmen	s	m
16	Monta Ellis	Mònta Èlis	Mònta Ẽlis	Mòntei Èlis	s	m
6	Navi Pillay	Nãvi Piláj	Nãvi Pilái	Nãvi Pìlei	m	m
23	Nene Hilario	Nèn Ilãrijo	Nèn Hìlario	Nenè Ilãriju	l	l
10	Nicki Minaj	Nìki Minãdž	Nìki Minãj	Nìki Minãž	l	l
65	Oprah Winfrey	Òpra Vìnfri	Òpra Viñfri	Òupra Viñfri	m	m
7	Peter Hess	Pẽter Hès	Pìter Hès		m	0
6	Peteris Altmaieris	Pė̃teris Áltmajeris	Pė̃teris Altmãjeris		0	s
6	Peteris Lerneris	Pìteris Ler̃neris	Pė̃teris Ler̃neris		0	l
7	Peteris Szijjarto	Pė̃teris Sìjarto	Pė̃teris Šidžárto		0	l
6	Pol Pot	Pòl Pòt	Põl Pòt		0	s
7	Prabowo Subianto	Prãbovo Subi-ánto	Prabòvo Subi-ánto		s	0
5	Prosper Merimee	Pròsper Merìm	Pròsper Merìmi	Prospèr Merimė̃	l	l
10	Raffaele Sollecito	Rafaèl Solečìto	Rafaèl Solesìto	Rafaèle Solèčito	m	l
6	Ralphas Fiennesas	Rálfas Fáinsas	Rálfas Fi-ènesas	Reĩfas Fáinzas	l	l
10	Ryan Toolson	Raján Tū̃lson	Raján Tùlson	Rãjan Tū̃lson	m	m
18	Sabine Kehm	Sabìn Kė̃m	Sabìn Kèm	Zabỹne Kė̃m	l	l
6	Sakellie Daniels	Sakelì Dẽni-els	Sakelì Dãni-els	Sakèli Dẽni-els	s	s
5	Salma Hayek	Sèlma Hãjek	Sálma Hãjek		0	m
6	Salvador Allende	Salvadòr Aljènd	Salvadòr Alènd	Salvadòr Aljènde	l	l
4	Salvadoras Allende	Salvadòras Aljènd	Salvadòras Alènd	Salvadòras Aljènde	l	l
5	Samantha Murray	Samánta Miurė̃j	Samánta Mùrėj		0	m
19	Sergio Mattarella	Ser̃džijo Matarèla	Ser̃chijo Matarèla		0	l
4	Shuji Nakamura	Šiùdži Nakamùra	Šùdži Nakamùra		s	0
6	Syd Barrett	Sìd Bãret	Sáid Bãret	Sìd Bẽret	s	l
9	Silvio Berlusconi	Sìlvijo Berluskòni	Sil̃vijo Berluskòni		0	0
5	Simon Frekley	Sáimon Frìkli	Sáimon Frèkli		s	0
4	Stephenas Mullas	Stìvenas Mãlas	Stìvenas Màlas	Stỹvenas Màlas	s	s
9	Steve Wozniak	Stìv Vòzniak	Stỹv Vòzniak		s	0
7	Steven Theede	Stỹven Tỹd	Stìven Tìd		0	s
4	Taneras Yildizas	Tãneras Jildìzas	Tanèras Jildìzas		0	s
14	Thomas Hobbes	Tòmas Hòbs	Tòmas Hòbes	Tòmas Hòbz	s	l
19	Timothy Geithner	Tìmoti Gáitner	Tìmoti Geĩtner		0	l
13	Valerie Trierweiler	Valerì Trìjerveler	Valerì Trỹrvailer	Valerỹ Trijervailèr	m	m
28	Valerie Trierweiler	Valerì Trirveĩler	Valerì Trỹrvailer	Valerỹ Trijervailèr	m	m
4	Vitaly Kamluk	Vitãli Kamliùk	Vitãli Kamlùk		0	0
9	Woodrow Wilson	Vùdrov Vìlson	Vùdrou Vìlson		s	0
5	Woodrow Wilsono	Vùdrau Vìlsono	Vùdrou Vìlsono		s	0

References

McArthur, T. Transliteration. In Concise Oxford Companion to the English Language; Oxford University Press: Oxford, UK, 2018; Available online: https://www.encyclopedia.com/humanities/encyclopedias-almanacs-transcripts-and-maps/transliteration (accessed on 7 May 2025).
Superanskaja, A.V. Teoreticheskie Osnovy Prakticheskoj Transkripcii [Theoretical Foundations of Practical Transcription], 2nd ed.; LENAND: Moscow, Russia, 2018; pp. 10–40. Available online: https://archive.org/details/raw-..-2018/page/1/mode/2up (accessed on 7 May 2025). (In Russian)
Lui, M.; Baldwin, T. langid.py: An off-the-shelf language identification tool. In Proceedings of the ACL 2012 System Demonstrations, Jeju Island, Republic of Korea, 9–11 July 2012; pp. 25–30. [Google Scholar]
Joulin, A.; Grave, E.; Bojanowski, P.; Mikolov, T. Bag of tricks for efficient text classification. arXiv 2017, arXiv:1607.01759. Available online: https://arxiv.org/abs/1607.01759 (accessed on 7 May 2025).
Google. Compact Language Detector v3 (CLD3). Available online: https://github.com/google/cld3 (accessed on 7 May 2025).
Papariello, L. XLM-Roberta-Base Language Detection. Hugging Face. 2021. Available online: https://huggingface.co/papluca/xlm-roberta-base-language-detection (accessed on 7 May 2025).
Apple. Language Identification from Very Short Strings. Apple Machine Learning Research. 2019. Available online: https://machinelearning.apple.com/research/language-identification-from-very-short-strings (accessed on 7 May 2025).
Toftrup, M.; Sørensen, S.A.; Ciosici, M.R.; Assent, I. A reproduction of Apple’s bi-directional LSTM models for language identification. arXiv 2021, arXiv:2102.06282. Available online: https://arxiv.org/abs/2102.06282 (accessed on 24 April 2025).
Moillic, J.; Ismail Fawaz, H. Language Identification for Very Short Texts: A review. Medium. 2022. Available online: https://medium.com/besedo-engineering/language-identification-for-very-short-texts-a-review-c9f2756773ad (accessed on 7 May 2025).
Kostelac, M. Comparison of Language Identification Models. ModelPredict. 2021. Available online: https://modelpredict.com/language-identification-survey (accessed on 7 May 2025).
International Phonetic Association. Handbook of the International Phonetic Association: A Guide to the Use of the International Phonetic Alphabet; Cambridge University Press: Cambridge, UK, 1999. [Google Scholar] [CrossRef]
OpenAI. GPT-4 Technical Report. 2023. Available online: https://arxiv.org/pdf/2303.08774 (accessed on 24 April 2025).
Wei, J.; Wang, X.; Schuurmans, D.; Bosma, M.; Ichter, B.; Xia, F.; Chi, E.; Le, Q.; Zhou, D. Finetuned language models are zero-shot learners. arXiv 2021, arXiv:2109.01652. Available online: https://arxiv.org/abs/2109.01652 (accessed on 7 May 2025).
Ouyang, L.; Wu, J.; Jiang, X.; Almeida, D.; Wainwright, C.; Mishkin, P.; Zhang, C.; Agarwal, S.; Slama, K.; Ray, A.; et al. Training language models to follow instructions with human feedback. arXiv 2022, arXiv:2203.02155. Available online: https://arxiv.org/abs/2203.02155 (accessed on 7 May 2025).
Zhang, S.; Liang, Y.; Shin, R.; Chen, M.; Du, Y.; Li, X.; Ram, A.; Zhang, Y.; Ma, T.; Finn, C. Instruction tuning for large language models: A survey. arXiv 2023, arXiv:2308.10792. Available online: https://arxiv.org/abs/2308.10792 (accessed on 7 May 2025).
Ainsworth, W. A system for converting English text into speech. IEEE Trans. Audio Electroacoust. 1973, 21, 288–290. [Google Scholar] [CrossRef]
Elovitz, H.; Johnson, R.; McHugh, A.; Shore, J. Letter-to-sound rules for automatic translation of English text to phonetics. IEEE Trans. Acoust. Speech Signal Process. 1976, 24, 446–459. [Google Scholar] [CrossRef]
Divay, M.; Vitale, A.J. Algorithms for grapheme-phoneme translation for English and French: Applications for database searches and speech synthesis. Comput. Linguist. 1997, 23, 495–523. [Google Scholar]
Damper, R.I.; Eastmond, J.F. A comparison of letter-to-sound conversion techniques for English text-to-speech synthesis. Comput. Speech Lang. 1997, 11, 33–73. [Google Scholar]
Finch, A.; Sumita, E. Phrase-based machine transliteration. In Proceedings of the Workshop on Technologies and Corpora for Asia-Pacific Speech Translation, Hyderabad, India, 11 January 2008. [Google Scholar]
Sutskever, I.; Vinyals, O.; Le, Q.V. Sequence to sequence learning with neural networks. In Proceedings of the NeurIPS 2014, Montreal, Canada, 8–11 December 2014. [Google Scholar]
Bahdanau, D.; Cho, K.; Bengio, Y. Neural machine translation by jointly learning to align and translate. In Proceedings of the ICLR 2015, San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
Gehring, J.; Auli, M.; Grangier, D.; Dauphin, Y.N. Convolutional sequence to sequence learning. In Proceedings of the ICML 2017, Sydney, Australia, 6–11 August 2017. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the NeurIPS 2017, Long Beach, CA, USA, 4–9 December 2017; pp. 6000–6010. [Google Scholar]
Johnson, M.; Schuster, M.; Le, Q.V.; Krikun, M.; Wu, Y.; Chen, Z.; Thorat, N.; Viégas, F.; Wattenberg, M.; Corrado, G.; et al. Google’s multilingual neural machine translation system: Enabling zero-shot translation. Trans. Assoc. Comput. Linguist. 2017, 5, 339–351. [Google Scholar] [CrossRef]
Conneau, A.; Lample, G.; Ranzato, M.; Denoyer, L.; Jégou, H. Unsupervised cross-lingual representation learning at scale. In Proceedings of the ACL 2020, Online, 5–10 July 2020; pp. 8440–8451. Available online: https://aclanthology.org/2020.acl-main.747/ (accessed on 7 May 2025).
Luong, M.T.; Pham, H.; Manning, C.D. Effective approaches to attention-based neural machine translation. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Lisbon, Portugal, 17–21 September 2015; pp. 1412–1421. [Google Scholar]
Cotterell, R.; Kirov, C.; Sylak-Glassman, J.; Walther, G.; Vylomova, E.; McCarthy, A.D.; Kann, K.; Mielke, S.J.; Nicolai, G.; Silfverberg, M.; et al. The CoNLL–SIGMORPHON 2018 Shared Task: Universal Morphological Reinflection. In Proceedings of the CoNLL–SIGMORPHON 2018 Shared Task, Brussels, Belgium, 31 October–1 November 2018; pp. 1–27. [Google Scholar]
Wu, S.; Cotterell, R.; Hulden, M. Applying the transformer to character-level transduction. arXiv 2020, arXiv:2005.10213. [Google Scholar]
Gorman, K.; Ashby, L.F.E.; Goyzueta, A.; McCarthy, A.; Wu, S.; You, D. The SIGMORPHON 2020 Shared Task on Multilingual Grapheme-to-Phoneme Conversion. In Proceedings of the 17th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology, Online, 10 July 2020; pp. 40–50. [Google Scholar]
Raškinis, G. Transliteration List of Foreign Person Names into Lithuanian v.1; CLARIN-LT: Kaunas, Lithuania, 2025; Available online: http://hdl.handle.net/20.500.11821/68 (accessed on 7 May 2025).
Norkevičius, G.; Raškinis, G.; Kazlauskienė, A. Knowledge-based grapheme-to-phoneme conversion of Lithuanian words. In Proceedings of the SPECOM 2005, 10th International Conference Speech and Computer, Patras, Greece, 17–19 October 2005; pp. 235–238. [Google Scholar]
Kazlauskienė, A.; Raškinis, G.; Vaičiūnas, A. Automatinis Lietuvių Kalbos žodžių Skiemenavimas, Kirčiavimas, Transkribavimas [Automatic Syllabification, Stress Assignment and Phonetic Transcription of Lithuanian Words]; Vytautas Magnus University: Kaunas, Lithuania, 2010; Available online: https://hdl.handle.net/20.500.12259/254 (accessed on 15 April 2025). (In Lithuanian)
Kirčiuoklis—A Tool for Placing Stress Marks on Lithuanian Words. Available online: https://kalbu.vdu.lt/mokymosi-priemones/kirciuoklis/ (accessed on 24 April 2025).
Novak, J.R.; Minematsu, N.; Hirose, K. Phonetisaurus: Exploring grapheme-to-phoneme conversion with joint n-gram models in the WFST framework. Nat. Lang. Eng. 2016, 22, 907–938. [Google Scholar] [CrossRef]
Taylor, P. Hidden Markov models for grapheme to phoneme conversion. In Proceedings of the Interspeech 2005, Lisbon, Portugal, 4–8 September 2005; pp. 1973–1976. [Google Scholar] [CrossRef]
Lee, J.L.; Ashby, L.F.E.; Garza, M.E.; Lee-Sikka, Y.; Miller, S.; Wong, A.; McCarthy, A.D.; Gorman, K. Massively multilingual pronunciation mining with WikiPron. In Proceedings of the 12th Language Resources and Evaluation Conference, Marseille, France, 11–16 May 2020; pp. 4216–4221. [Google Scholar]
Viterbi, A.J. Error bounds for convolutional codes and an asymptotically optimum decoding algorithm. IEEE Trans. Inf. Theory 1967, 13, 260–269. [Google Scholar] [CrossRef]
Roark, B.; Sproat, R.; Allauzen, C.; Riley, M.; Sorensen, J.; Tai, T. The OpenGrm open-source finite-state grammar software libraries. In Proceedings of the ACL 2012 System Demonstrations, Jeju Island, Republic of Korea, 9–11 July 2012; pp. 61–66. [Google Scholar]
Allauzen, C.; Riley, M.; Schalkwyk, J.; Skut, W.; Mohri, M. OpenFst: A general and efficient weighted finite-state transducer library. In Proceedings of the CIAA, Prague, Czech Republic, 16–18 July 2007; pp. 11–23. [Google Scholar]
Gorman, K. Pynini: A Python library for weighted finite-state grammar compilation. In Proceedings of the SIGFSM Workshop on Statistical NLP and Weighted Automata, Berlin, Germany, 12 August 2016; pp. 75–80. [Google Scholar]
Ott, M.; Edunov, S.; Baevski, A.; Fan, A.; Gross, S.; Ng, N.; Grangier, D.; Auli, M. Fairseq: A fast, extensible toolkit for sequence modeling. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations), Minneapolis, MN, USA, 2–7 June 2019; pp. 48–53. [Google Scholar]
Szegedy, C.; Vanhoucke, V.; Ioffe, S.; Shlens, J.; Wojna, Z. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 2818–2826. [Google Scholar]
Song, H.; Kim, M.; Park, D.; Shin, J. Learning from noisy labels with deep neural networks: A survey. arXiv 2020, arXiv:2007.08199. Available online: https://arxiv.org/abs/2007.08199 (accessed on 7 May 2025). [CrossRef] [PubMed]
Sohn, K.; Berthelot, D.; Li, C.; Zhang, Z.; Carlini, N.; Cubuk, E.D.; Kurakin, A.; Zhang, H.; Raffel, C. FixMatch: Simplifying semi-supervised learning with consistency and confidence. arXiv 2020, arXiv:2001.07685. Available online: https://arxiv.org/abs/2001.07685 (accessed on 7 May 2025).

Figure 1. Overview of the data processing pipeline from raw web data to trained transcription models.

Figure 2. Simplified pseudo-code of the word role and order normalization function, which outputs a normalized transcription pattern from four word stems.

Figure 3. Word error rate (WER) and stress-adjusted word error rate (WER-s) estimated on the test partitions for WFST-based models trained on single-word (a) and word-pair (b) datasets.

Table 1. Examples of personal names and IPA transcriptions provided by GPT-4o.

No.	Original Person Name	LLM Answer (IPA Transcription)
1	Bill Michael	/bɪl ˈmaɪkəl/
2	Michael Krüger	/ˈmɪçaˌeːl ˈkʁyːɡɐ /
3	Florence Ballard	/ˈflɔːrəns ˈbælərd/
4	Brendan Gilligan	/ˈbrɛn.dən ˈɡɪ.lɪ.ɡən/
5	Mariah Carey	/məˈraɪ.ə ˈkɛɹ.i/
6	Valerie Trierweiler	/va.le.ʁi tʁi.ɛʁ.vaj.lɛʁ/
7	Sergio Mattarella	/ˈsɛrdʒo mattaˈrɛlla/
8	Donald Tusk	/ˈdɔnalt tʊsk/
9	Laurent’as Fabiusas	/lɐu̯ˈrʲɛn.tɐs ˈfɐ.bʲu.sɐs/
10	Ralphas Fiennesas	/ˈral.fɐs ˈfʲɛn.nɛ.sɐs/

Table 2. Sample patterns extracted from web-scraped Lithuanian news texts.

No.	Pattern, Occurrences	Remarks
1	Andrew Osagie (Didžioji Britanija),1	Unintended match: person (location)
2	Žygimantas Augustas (Vytautas Rumšas),1	Unintended match: person (actor)
3	Žygimantas Šeštokas (Utenos Juventus), 1	Unintended match: person (sports team)
4	Andrius Bernotas (Tikėjimo Žodis), 1	Unintended match: person (church)
5	Butano Karalystėje (Pietų Azija), 1	Unintended match: location (location)
6	Baracko Obamos (Barako Obamos), 358	Original first
7	Barakas Obama (Barack Obama), 590	Role inversion, transcription first
8	Bušas Džordžas (George Bush), 1	Word order inversion
9	Alicios Silverstone (Ališijos Silverstoun), 1	Adaptation of original by fusion
10	Alastairis Bruce’as (Alisteris Briusas), 2	Adaptation by concatenation
11	Čarlzą Darviną (Charles Darwin), 3	Transcription in accusative
12	Čarlzas Darvinas (Charles Darwin), 9	Transcription in nominative
13	Čarlzo Darvino (Charles Darwin), 10	Transcription in genitive
14	Čarlzui Darvinui (Charles Darwin), 1	Transcription in dative
15	Aleksiui Ciprui (Alexis Tsipras), 13	Original in nominative, transcription in dative
16	Alberto Alonso (Albertas Alonsas), 1	Non-inflected original, transcription in nominative
17	Caitlin Cahow (Keitlin Kahou), 2	Transcription variation
18	Caitlin Cahow (Keitlin Kehou), 1	Transcription variation
19	Caitlin Cahow (Ketlin Kahau), 2	Transcription variation
20	Aleksis Cipras (Alexis Tspiras), 1	Spelling error in original
21	Alavanei Ouatarai (Alassane Ouattara), 1	Spelling errors in transcription

Table 3. Complete list of raw patterns and their normalized versions related to the original name Alicia Keys.

Raw Pattern, Occurrences	Operations Performed	Normalized Pattern, Occurrences
Alicia Keyes (Alicija Kis), 2	Alignment failure, deleted
Alicia Keys (Ališa Kis), 2		Alicia Keys (Ališa Kis), 2
Alicia Keys (Ališija Kis), 8		Alicia Keys (Ališija Kis), 8
Alicia Keys (Ališija Kys), 5		Alicia Keys (Ališija Kys), 7
Ališija Kys (Alicia Keys), 2	Role inverted, merged with preceding pattern
Alicijos Kys (Alicia Keys),1	Role inverted, genitive to nominative	Alicia Keys (Alicija Kys), 1
Alisa Kis (Alicia Keys), 1	Role inverted	Alicia Keys (Alisa Kis), 1
Ališa Kys (Alicia Keys), 1	Role inverted	Alicia Keys (Ališa Kys), 1
Alisija Kis (Alicia Keys), 1	Role inverted	Alicia Keys (Alisija Kis), 1
Alisija Kys (Alicia Keys), 1	Role inverted	Alicia Keys (Alisija Kys), 1

Table 4. The effects of statistical filtering on several frequently occurring names. Frequencies are shown in parentheses.

Original	Kept Transcriptions	Discarded Transcriptions
Alicia	Ališija (15), Ališa (3), Alisija (2),	Alicija (1), Alisa (1)
Andrew	Endrius (238), Endriu (122)	Andriu (4), Andru (2), Andrevas (1), Andrju (1), Andrėjus (1)
Charles	Čarlzas (217), Šarlis (69), Čarlis (11), Čarlz (5)	Čarlesas (2), Čarlsas (1), Carlzas (1),
Marie	Mari (99), Meri (22), Marija (15), Merė (2)	Mary (1)
Michael	Maiklas (647), Michaelis (129), Mišelis (10), Mikaelis (7), Michaelas (6)	Michaela (1), Michailas (1), Michalas (1), Michelis (1)

Table 5. Datasets for the sequence-to-sequence transcription task.

Data Set	Description	Training Instances	Unique Instances	Oracle WER (%)
Raw data	Word pairs	133,254	68,167
W1	Single-word dataset: individual mappings of O1 → T1, O2 → T2	239,143	52,167	7.09
W2	Word-pair dataset: O1 O2 → T1 T2	118,149	51,429	5.39
W2+R2	W2 augmented with reversed pairs: O2 O1 → T2 T1	236,216	102,568	5.43
W2+R2+W1	W2+R2 further augmented with W1	472,432	152,979	6.29

Table 6. Comparison of model accuracy on base and augmented transcription tasks. Estimated word error rate (WER) and stress-adjusted WER (WER-s), along with their 95% confidence intervals, are reported. The bold number indicates a statistically significant difference between the best and the second-best result.

Task	Model	Best Hyperparameters	WER, %	WER-s, %
W1	WFST	n = 7	40.19	24.08
	Encoder–decoder	BS = 1024, DOUT = 0.3	26.87 ± 0.43	21.56 ± 0.35
	Transformer	BS = 1024, DOUT = 0.3	25.49 ± 0.25	19.81 ± 0.22
W2	WFST	n = 8	23.67	16.91
	Encoder–decoder	BS = 1024, DOUT = 0.3	20.22 ± 0.4	16.38 ± 0.24
	Transformer	BS = 256, DOUT = 0.3	19.98 ± 0.19	16.19 ± 0.14
W2+R2	WFST	n = 9	24.29	17.97
	Encoder–decoder	BS = 1024, DOUT = 0.3	19.81 ± 0.25	15.67 ± 0.24
	Transformer	BS = 256, DOUT = 0.3	19.04 ± 0.18	15.66 ± 0.12
W2+R2+	WFST	n = 9	24.55	18.13
W1	Encoder–decoder	BS = 1024, DOUT = 0.3	19.91 ± 0.27	16.27 ± 0.3
	Transformer	BS = 1024, DOUT = 0.1	19.30 ± 0.17	16.16 ± 0.17

Table 7. Categories of transcription errors.

Error Category	Description
No error (0)	No mismatch.
Small (s)	Minor, likely imperceptible differences, such as stress-type changes affecting vowel length (e.g., Bãzas vs. Bàzas for “Buzz”), or substitution of similar vowels (e.g., Frỹdman vs. Frỹdmen for “Friedman”).
Medium (m)	Noticeable differences, such as incorrect stress placement (e.g., Solečìto vs. Solèčito for “Sollecito”) or mild pronunciation shifts (e.g., Òpra vs. Òupra for “Oprah”).
Large (l)	Prominent mismatches involving consonants, accented vowels, or the insertion/deletion/substitution of key phonetic elements (e.g., Marìja vs. Merãja, Sáid vs. Sìd, Kárl vs. Kárlos, Minãj vs. Minãž for “Mariah,” “Syd,” “Carlos,” and “Minaj,” respectively).

Table 8. Cross-tabulation of reference and prediction mismatches with respect to the golden standard. False errors (green, 13%) occur when the model’s prediction matches the golden standard. Cases where both the prediction and the reference are different from the golden standard and prediction are better (light green, 4%), comparable (27.3%), or worse (light red, 9.5%) than the reference according to the mismatch severity. True errors (red, 44%) occur when the reference matches the golden standard while the prediction does not.

		Prediction Mismatch
		0	s	m	l
	0	24	141	33	311
Reference	s	105	84	39	40
Mismatch	m	25	21	127	26
	l	13	18	6	90

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Raškinis, G.; Amilevičius, D.; Kalinauskaitė, D.; Mickus, A.; Vitkutė-Adžgauskienė, D.; Čenys, A.; Krilavičius, T. A Semi-Automatic Framework for Practical Transcription of Foreign Person Names in Lithuanian. Mathematics 2025, 13, 2107. https://doi.org/10.3390/math13132107

AMA Style

Raškinis G, Amilevičius D, Kalinauskaitė D, Mickus A, Vitkutė-Adžgauskienė D, Čenys A, Krilavičius T. A Semi-Automatic Framework for Practical Transcription of Foreign Person Names in Lithuanian. Mathematics. 2025; 13(13):2107. https://doi.org/10.3390/math13132107

Chicago/Turabian Style

Raškinis, Gailius, Darius Amilevičius, Danguolė Kalinauskaitė, Artūras Mickus, Daiva Vitkutė-Adžgauskienė, Antanas Čenys, and Tomas Krilavičius. 2025. "A Semi-Automatic Framework for Practical Transcription of Foreign Person Names in Lithuanian" Mathematics 13, no. 13: 2107. https://doi.org/10.3390/math13132107

APA Style

Raškinis, G., Amilevičius, D., Kalinauskaitė, D., Mickus, A., Vitkutė-Adžgauskienė, D., Čenys, A., & Krilavičius, T. (2025). A Semi-Automatic Framework for Practical Transcription of Foreign Person Names in Lithuanian. Mathematics, 13(13), 2107. https://doi.org/10.3390/math13132107

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Semi-Automatic Framework for Practical Transcription of Foreign Person Names in Lithuanian

Abstract

1. Introduction

1.1. Multi-Stage Approach

1.2. Generative AI Approach

1.3. End-to-End Machine Learning Approach

1.4. Related Work

2. Materials and Methods

2.1. Data

2.2. Method

2.2.1. Data Preprocessing

2.2.2. Gathering Stress Data

2.2.3. Generating Training and Test Sets

2.2.4. Training Transcription Models

2.3. Evaluation Metrics

3. Results

3.1. Model Accuracy

3.2. Detailed Error Analysis

4. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

Appendix A

Appendix B

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI