A Rule-Based Grapheme-to-Phoneme Conversion System

Kłosowski, Piotr

doi:10.3390/app12052758

Open AccessArticle

A Rule-Based Grapheme-to-Phoneme Conversion System

by

Piotr Kłosowski

Department of Telecommunication and Teleinformatics, Faculty of Automatic Control, Electronics and Computer Science, Silesian University of Technology, Akademicka 16, 44-100 Gliwice, Poland

Appl. Sci. 2022, 12(5), 2758; https://doi.org/10.3390/app12052758

Submission received: 12 January 2022 / Revised: 27 February 2022 / Accepted: 3 March 2022 / Published: 7 March 2022

(This article belongs to the Special Issue Automatic Speech Recognition)

Download

Browse Figures

Versions Notes

Abstract

:

This article presents a rule-based grapheme-to-phoneme conversion method and algorithm for Polish. It should be noted that the fundamental grapheme-to-phoneme conversion rules have been developed by Maria Steffen-Batóg and presented in her set of monographs dedicated to the automatic grapheme-to-phoneme conversion of texts in Polish. The author used previously developed rules and independently developed the grapheme-to-phoneme conversion algorithm.The algorithm has been implemented as a software application called TransFon, which allows the user to convert any text in Polish orthography to corresponding strings of phonemes, in phonemic transcription. Using TransFon, a phonemic Polish language corpus was created out of an orthographic corpus. The phonemic language corpusallows statistical analysis of the Polish language, as well as the development of phoneme- and word-based language models for automatic speech recognition using statistical methods. The developed phonemic language corpus opens up further opportunities for research to improve automatic speech recognition in Polish. The development of statistical methods for speech recognition and language modelling requires access to large language corpora, including phonemic corpora. The method presented here enables the creation of such corpora.

Keywords:

grapheme-to-phoneme conversion; speech recognition; language corpus; language modelling; language statistical analysis

1. Introduction

Natural language processing often requires grapheme-to-phoneme (G2P) conversion of an orthographic text [1]. G2P converts strings of graphemes to corresponding sequences of phonetic transcription characters, directly from orthographic representations and it is crucial for many applications in various areas of speech and language processing [2]. The most frequent applications of grapheme-to-phoneme conversion occur in text-to-speech systems, which require high quality phonetic transcriptions to function well [3]. Tools for converting graphemes to phonemes are also used in theoretical and applied linguistics. Such tools are useful in many areas of linguistic research (e.g., phonetics, phonology, dialectology, and language acquisition), in order to obtain preliminary phonetic transcriptions of large language corpora [4].

The main goal of research on the conversion of graphemes to phonemes is improving speech recognition for the Polish language [5,6]. In addition, research studies have been conducted on the phonetic properties of Polish phonemes [7,8], speech recognition based on such analyses [9], speaker recognition [10,11,12,13,14] and new applications of speech and language processing (e.g., speech translation) [15]. Particularly good results in speech recognition are achieved through the use of statistical language models [16,17,18,19]. However, currently the field of language modelling is shifting from statistical methods to neural networks and deep learning methods [20,21]. The development of statistical and deep learning methods for speech recognition and language modelling requires access to large language corpora, including phonemic corpora [18,19,20,21,22,23,24,25]. The main motivation for undertaking this research on automatic grapheme-to-phoneme conversion and its application, was the development of effective methods of creating a phonemic language corpus for Polish, comprised of phonemic transcriptions derived from an orthographic language corpus through grapheme-to-phoneme conversion.

2. Problem Formulation

The process of converting graphemes to phonemes in orthographic text involves converting a string of orthographic characters into a corresponding string of phonetic transcription characters (representing phonemes or allophones) [2]. A ‘grapheme’ is any of the units of any writing system for any language, a term coined by analogy with the ‘phoneme’ of a spoken language [26]. Graphemes include alphabetic letters, typographic ligatures, numerical digits, punctuation marks, and other individual symbols of writing systems. Since the orthographic text is the only source of pronunciation information in the process of converting graphemes into phonemes, this process must be based on appropriate formal rules, depicting the correct pronunciation of orthographic strings in a given language [27].

Phonemes are usually written in specially designed alphabets. The most widely used alphabet is the International Phonetic Alphabet (IPA) [28]. For the Polish language, as with other Slavic languages, a special transcription system, called the Slavistic Phonetic Alphabet (SPA), is most frequently used [29]. The other very commonly used phonetic alphabet is the Speech Assessment Methods Phonetic Alphabet (SAMPA) [30]. SAMPA is a machine-readable phonetic alphabet, using 7-bit printable ASCII characters, based on the IPA. Table 1 presents the phonemic inventory of Polish with examples, in the SPA, IPA, and SAMPA phonetic alphabets and corresponds to the set of phonemes used for the purpose of this study.

Automatic grapheme-to-phoneme conversion is not a new problem. The first linguist who noted it, and tried to provide a solution for a particular language (Czech), was H. Kučera [31]. Research on solutions to the automatic grapheme-to-phoneme conversion problem have also been initiated for other languages [32,33,34].

In Poland, the first linguist who wrote about the possibility of phonetic interpretation of text by machines was W. Doroszewski in 1969 [35]. The largest contributions to solving the problem of automatic grapheme-to-phoneme conversion for Polish, were the publications of Maria Steffen-Batóg [36,37]. The first implementation of a grapheme-to-phoneme conversion algorithm for Polish, designed for the machine ODRA 1204, was made in 1971 by M. Warmus [38]. Further attempts to implement automatic grapheme-to-phoneme conversion have also been reported in various publications from that time to the present [39,40,41,42,43,44,45].

3. Methodology

Systems for converting graphemes to phonemes can be implemented in many ways, often roughly classified as either dictionary-based or knowledge-based (rule-based) strategies, although there are many intermediate solutions. Data-driven (dictionary-based) solutions involve storing as much phonological knowledge as possible in a lexicon, while rule-based systems consist of rules based on inference or proposed by expert linguists. Both dictionary-based and rule-based methods for converting graphemes to phonemes have their advantages and limitations. Searching for a word in the lexicon is relatively computationally inexpensive, whereas most rule-based system algorithms consume significantly more computational resources to produce a phoneme sequence. In addition, the dictionary method requires a large phonetic dictionary and complex morphophonemic rules, while the rule-based method is unable to model full morphophonemic constraints on its own. Very often, a rule-based system will incorporate a dictionary as an exception list. This solution was used in the grapheme-to-phoneme conversion application presented here.

3.1. Conversion Rules

The first set of rules for the conversion of graphemes into phonemes for the Polish language was developed by Maria Steffen-Batóg and presented in her set of monographs dedicated to automatic grapheme-to-phoneme conversion of texts in Polish [36,37]. Knowledge-based grapheme-to-phoneme conversion, unlike data-driven approaches, exploits rules derived by humans and/or from linguistic studies [46]. The rules for rule-based grapheme-to-phoneme are typically formulated in the framework of finite state automata [47]. The primary advantage of such rule-based approaches is that they can provide complete coverage. To achieve this objective it is crucial to use the algebra of sets and, for this purpose, it is necessary to define the basic sets of graphemes employed. Examples of these basic sets for the Polish language are presented below [37]. One of the basic sets is Z, E, and S. Z is a set of alphabetic characters for Polish, E is a set of three graphemes that are non-alphabetic for Polish, and S is a set of special characters [37]:

\begin{matrix} Z & = & {a, ą, b, c, ć, d, e, ę, f, g, h, i, j, k, l, ł, m, n, ń, o, ó, p, r, s, ś, t, u, w, y, z, ź, ż} \end{matrix}

(1)

E = {q, v, x}

(2)

S = {., ?,!,,, :,;, -, (,), #, /}

(3)

The union of sets

Z \cup S

will be indicated by

Z + S

and defined as follows:

Z + S = {α : α \in Z \lor α \in S}

(4)

Similarly, subtraction of sets

Z ∖ S

will be indicated by

Z - S

. Auxiliary subsets

O \subset S

and

P \subset S

are also defined as follows:

O = S - {/} = {., ?,!,,, :,;, -, (,), #}

(5)

P = O - {#} = {., ?,!,,, :,;, -, (,)}

(6)

The character “#” is a pause character between the words and the character “/” is a sign to separate characters within the word. The set X is a collection of alphabetic characters in Polish, including special characters:

X = Z + E + S

(7)

\begin{matrix} X & = & {a, ą, b, c, ć, d, e, ę, f, g, h, i, j, k, l, ł, m, n, ń, o, ó, p, r, s, ś, t, u, w, y, z, ź, ż, \\ q, v, x, ., ?,!,,, :,;, -, (,), #, /} \end{matrix}

(8)

The set X does not contain digits because all numbers are always converted to alphabetic representations. Let

Γ

and

Δ

be arbitrary sets of words in the X alphabet. The concatenation of the sets

Γ

and

Δ

, written as

Γ \cdot Δ

or

Γ Δ

is an operation defined as follows:

Γ \cdot Δ = Γ Δ = {α β : α \in Γ \land β \in Δ}

(9)

The set of

Γ \cdot Δ

contains as elements all strings of characters in the form

α β

for which

α \in Γ

and

β \in Δ

. For any words

α

,

β

,

γ

,

δ

we receive:

{α, β} \cdot {γ, δ} = {α γ, α δ, β γ, β δ}

(10)

Here is an example to illustrate this type of relationship:

\begin{matrix} {s, c} \cdot A = {s, c} A & = & {s a, s ą, s e, s ę, s i, s o, \\ s ó, s u, s y, c a, c ą, c e, \\ c ę, c i, c o, c ó, c u, c y} \end{matrix}

(11)

where A is the set of vowels in Polish:

A = {a, ą, e, ę, i, o, ó, u, y}

(12)

It is also necessary to define empty words, written as 1, which do not contain any letters of the X alphabet. In this case, for any word

α

, equality is fulfilled:

α \cdot 1 = 1 \cdot α = α

(13)

The process of automatically converting graphemes to phonemes can be described as a function F defined by the following formula:

F (α) = β

(14)

where:

α = α_{1} . . . α_{k} . . . α_{a} \land α_{k} \in X \forall (1 \leq k \leq a)

(15)

β = β_{1} . . . β_{k} . . . β_{b} \land β_{k} \in Y \forall (1 \leq k \leq b)

(16)

and where

α

is a sequence of a alphabetic characters,

β

is a sequence of b phonemic characters, b is determined by a, but a need not be equal to b at all. The set Y is the entire set of phonemic characters for Polish described by the SPA alphabet:

\begin{matrix} Y & = & {i, y, e, a, o, u, i ̯, u ̯, r, l, m, n, ń,, f, v, s, z, š, ž, ś, ź, χ, p, b, t, \\ d, k, g, ḱ, ǵ, c, ʒ, č,, ć,} \end{matrix}

(17)

The grapheme-to-phoneme conversion of correctly written orthographic texts in Polish is the transformation of sequences written in the X alphabet, to a correct form in the phonemic alphabet Y. The grapheme-to-phoneme conversion F function can be described by a set of formal grapheme-to-phoneme conversion rules defining how each sequence

α

representing a word in the X alphabet, can be transformed into a correct sequence

β

in the phonemic alphabet defined by the set Y. The conversion rules are usually numerous with varying degrees of complexity. The size and complexity of a set of grapheme-to-phoneme conversion rules depends on the number of letters in the orthographic alphabet and the degree to which each letter can be pronounced differently in various contexts.

The first set of grapheme-to-phoneme conversion rules for Polish was developed by Maria Steffen-Batóg and presented in her set of monographs dedicated to automatic grapheme-to-phoneme conversion of texts in Polish [36,37]. The knowledge contained in these publications was the basis for developing the grapheme-to-phoneme conversion algorithm for Polish implemented by the author in the Python programming language [48]. According to Maria Steffen-Batóg, all grapheme-to-phoneme conversion rules relating to each orthographic letter can be stored in a matrix called the grapheme-to-phoneme conversion rule-table for the letter in question. The general form of a grapheme-to-phoneme conversion rule-table is shown in Table 2 [37].

Table 2 was constructed with the following assumptions [37]:

α_{k} \in X

(18)

\forall 1 \leq i \leq n : Γ_{i} \subset X

(19)

\forall 1 \leq j \leq m : Δ_{j} \subset X

(20)

\forall 1 \leq i \leq n \land 1 \leq j \leq m : β_{i, j} \in Y

(21)

The

α_{k}

letter in the upper left corner of the table indicates that it is the table of grapheme-to-phoneme conversion rules for the orthographic letter

α_{k}

. The headings of the columns in the table

(Δ_{1}

,

Δ_{2}

, …,

Δ_{j}

, …,

Δ_{m - 1}

,

Δ_{m})

represent various ways of phonetically transcribing the letter

α_{k}

, depending on the letters following

α

in context. Similarly, the headings of the rows

(Γ_{1}

,

Γ_{2}

, …,

Γ_{i}

, …,

Γ_{n - 1}

,

Γ_{n})

determine various ways of phonetically transcribing the letter

α_{k}

, depending on the letters preceding

α_{k}

in context. At the intersections of the rows and columns, one finds the grapheme-to-phoneme conversions for the letter

α_{k}

in each specific orthographic context. For example, the intersection of the i-th row and j-th column indicates the element

β_{i, j}

, which implies the following conversion rule:

For each orthographic string $s = γ α_{k} δ$ , where $γ \in Γ_{i}$ and $δ \in Δ_{j}$ , the grapheme-to-phoneme conversion process is defined as the assignment $α_{k} \to β_{i, j}$ .

The symbol

β_{i, j}

in a specific conversion rule could represent a single phonemic letter, sequence of phonemic letters, or empty letter, written as symbol “1”. It is also worth remembering that certain combinations of orthographic characters in a particular language may be absent from the table. Therefore, there may be some orthographic contexts of the letter

α_{k}

for which phonetic transcription cannot be determined. In such cases, specific fields in the table are left empty. This principle often allows the simplification of grapheme-to-phoneme conversion rules tables.

One of the simplest grapheme-to-phoneme conversion rules for Polish is the rule for orthographic letter “a” presented in Table 3. According to the grapheme-to-phoneme conversion rule defined in Table 3, each orthographic letter “a”, in all orthographic contexts (before or after any sequence of letters), should be converted into phoneme [a] [37].

An example of a more complex table containing nine grapheme-to-phoneme conversion rules, is the rule-table for letter “ą” presented in Table 4 [37].

The definition of grapheme-to-phoneme conversion rules for letter “ą” requires additional sets

Δ_{1}, . . ., Δ_{9}

defined as follows [37]:

Δ_{1} = {l, ł, m}

(22)

Δ_{2} = {p, b}

(23)

Δ_{3} = {t, k, g}

(24)

Δ_{4} = c (X - {i, h})

(25)

Δ_{5} = d (X - {z, ź})

(26)

Δ_{6} = d z (X - i)

(27)

Δ_{7} = {ć, c i, d ź, d z i}

(28)

Δ_{8} = {ś, ź, f, w, s, z, ż, c h}

(29)

Δ_{9} = O

(30)

Another example of a more complex rule table is the table for orthographic letter “u”. It consists of five parts, presented in Table 5, Table 6, Table 7 and Table 8. This table contains 88 grapheme-to-phoneme conversion rules [37].

The most complex are the tables of rules for converting graphemes to phonemes in Polish for the following orthographic letters: “z”— 174 rules, “s”— 157 rules, “r”— 91 rules, “u”— 88 rules, and “d”— rules.

As presented in the literature by Steffen-Batóg, the grapheme-to-phoneme conversion rules are clear and universal [37], but they are correct only for Polish. For any other language, it is necessary to define that language’s own grapheme-to-phoneme conversion rules.

On the basis of Batóg’s work [36,37], the author of this paper implemented the grapheme-to-phoneme conversion rules for Polish in the Python programming language [49]. The implementation includes 975 conversion rules covering all 35 letters and special characters [48] of Polish orthography. Although many words have multiple correct pronunciations, the rules implemented here in this way produce only one (the most basic) pronunciation for any given word. The list of implemented grapheme-to-phoneme conversion rules for Polish is presented in Table 9, where: P is a set of special characters excluding

{#, @, -, /}

, “#” is a pause character at the end of a word, “@” is a pause character at the beginning of a word, “/” is a delimiter character and “-” is a dash character.

The list presented in Table 9 shows that the number of grapheme-to-phoneme conversion rules implemented is 975. The number of cells (potential letter-context combinations) in grapheme-to-phoneme conversion tables is much larger, here 5162. This means that some of them (i.e., 4187 cells) are empty and have no function in the grapheme-to-phoneme conversion process. Empty cells represent letter-context combinations not allowed in Polish.

3.2. Conversion Algorithm

The grapheme-to-phoneme conversion rules are the foundation of the grapheme-to-phoneme conversion algorithm implemented here [48]. The conversion algorithm defines how the function

F (α) = β

works and determines the method for obtaining sequences of phonemic

β_{k}

letters on the basis of orthographic sequences

α_{k}

and their contexts. The block diagram of the grapheme-to-phoneme conversion algorithm for a single orthographic letter is presented in Figure 1.

The grapheme-to-phoneme conversion algorithm, using the general form of grapheme-to-phoneme conversion rules shown in Table 2, is shown here in terms of 12 steps:

(1)

Read an input orthographically represented word

α

, where

α = α_{1} α_{2} \dots α_{k} \dots α_{n}

and

α_{1}, \dots, α_{n} \in X

;

(2)

At the beginning of the orthographically represented word

α

, let

k = 1

;

(3)

Read a single letter

α_{k}

of the input word

α

;

(4)

Check the context of the letter

α_{k}

in the word

α

:

Read a string of letters $γ = α_{1} \dots α_{k - 1}$ preceding the orthographic $α_{k}$ letter;
Read a string of letters $δ = α_{k + 1} α_{k + 2} \dots α_{n}$ following the orthographic $α_{k}$ letter;

(5)

Select the appropriate grapheme-to-phoneme conversion rule table for the orthographic letter

α_{k}

;

(6)

Select the appropriate j-th table’s column for the orthographic letter

α_{k}

, where

δ \in Δ_{j}

;

(7)

Select the appropriate i-th table’s row for the orthographic letter

α_{k}

, where

γ \in Γ_{i}

;

(8)

Select the appropriate grapheme-to-phoneme conversion rule for the orthographic letter

α_{k}

, as defined by the cell in the j-th column and i-th row of the table;

(9)

Use the appropriate grapheme-to-phoneme conversion rule for the orthographic letter

α_{k}

, as defined by the cell in the j-th column and i-th row of the table:

In the cell with the table coordinates $[i, j]$ , phonemic letter $β_{i, j}$ letter is saved, corresponding to the orthographic letter $α_{k}$ with the following context $δ$ and preceding context $γ$ ;
The grapheme-to-phoneme conversion rule for orthographic letter $α_{k}$ will be implemented by assignment $α_{k} \to β_{k} = β_{i, j}$ ;

(10)

Increment

k = k + 1

and read the next orthographic letter

α_{k}

of the input word

α

;

(11)

Go to step number 3, unless

k > n

, where n is the length of input word

α

;

(12)

Output the result of the grapheme-to-phoneme conversion of orthographic word

α = α_{1} \dots α_{n}

where

α_{1}, \dots, α_{n} \in X

, is a new phonemic representation of the word

β = β_{1} \dots β_{n}

, where

β_{1}, \dots, β_{n} \in Y

.

These 12 steps summarize the automatic grapheme-to-phoneme conversion algorithm. This algorithm can also be presented as “pseudocode”, as shown in the list called Algorithm 1.

Algorithm 1. The algorithm for grapheme-to-phoneme conversion (implementation of function

F (α) = β

).

Require:

α = α_{1} \dots α_{n}

and

α_{1}, \dots, α_{n} \in X

1:

n \leftarrow l e n g t h (α)

2:

k \leftarrow 1

3: while (

k \leq n

) do

4:

γ \leftarrow α_{1} \dots α_{k - 1}

5:

δ \leftarrow α_{k + 1} α_{k + 2} \dots α_{n}

6:

t a b l e \leftarrow t a b l e s [α_{k}]

{% selection of appropriate table}

7:

R \leftarrow r o w s (t a b l e)

{% number of table rows}

8:

C \leftarrow c o l s (t a b l e)

{% number of table columns}

9: for all (

2 \leq i \leq R

) do

10: if (

γ \in Γ_{i}

) then

11:

I \leftarrow i

{% selection of appropriate table row}

12: end if

13: end for

14: for all(

2 \leq j \leq C

) do

15: if (

δ \in Δ_{j}

) then

16:

J \leftarrow j

{% selection of appropriate table column}

17: end if

18: end for

19:

β_{k} \leftarrow t a b l e [I, J]

{% reading phoneme letter from the table}

20:

k \leftarrow k + 1

21: end while

22:

β \leftarrow β_{1} \dots β_{k - 1}

{% string of phonemes}

22: return

β

4. Results

This grapheme-to-phoneme conversion algorithm for Polish was implemented in the Python programming language as an independent application called TransFon [48]. The developed application allows the grapheme-to-phoneme conversion of any orthographic text file in Polish.

Evaluation of automatic grapheme-to-phoneme conversion implementations is crucial. It is necessary to determine the application’s degree of success (accuracy). The evaluating and testing procedure for this automatic grapheme-to-phoneme conversion implementation consisted of the following elements:

Performing grapheme-to-phoneme conversion of an orthographic text corpus file containing the most frequently used words in Polish (those with different orthographic representations), obtained from resources in the National Corpus of Polish [50];
Validation and verification of the conversion results for those words using the Polish language dictionary that specifies the correct pronunciation of words in Polish;
Registering cases of incorrect conversion, errors, and other problems encountered;
Attempts to solve the problems.

This grapheme-to-phoneme conversion application was implemented in such a way that the conversion algorithm stopped when a grapheme-to-phoneme conversion problem occurred (e.g., when there was no rule allowing for a correct phonetic transcription). This made it easier to improve and develop the conversion application. In addition, any doubts about correct pronunciations were resolved with help of the aforementioned dictionary.

In order to evaluate the implemented grapheme-to-phoneme conversion system, the word error rate (WER), and phoneme error rate (PER) were used, as the methods usually used to evaluate the performance of automatic speech recognition systems. The WER value for this grapheme-to-phoneme conversion system was calculated as the ratio of the number of incorrectly converted words to the total number of words converted from the language corpus. The PER parameter was similarly defined and calculated for phonemes.

The causes of problems and errors in the automatic grapheme-to-phoneme conversion operation were as follows:

Errors in the implementation of the algorithm or conversion rules;
Missing conversion rules in tables (rules not included in the tables) for some orthographic letters in contexts;
Problems with conversion of foreign words, acronyms and words that are not in the Polish language dictionary.

These problems were solved in the following ways:

Implementation errors in the conversion algorithm and rule tables were corrected by modifying the application source code;
The problem of missing conversion rules in tables has been solved by adding rules to the tables. It should be noted that the added grapheme-to-phoneme conversion rules cooperate with the rules implemented earlier and known from the literature [36,37]. In order to complete the rule tables, new rules were added for selected letters e.g.: “i”, “n”, “d”, “z”, “ż”, “ć”, “f”, “s” in some contexts, in particular:
-
adding or correcting conversion rules for the letter “i” for correct conversion of words: “unii”, “będzie”, “sobie”, “razie”, “diabeł”, and similar contexts;
-
adding or correcting conversion rules for the letter “n” for correct conversion of words: “branży” and similar contexts;
-
adding or correcting conversion rules for the letter “d” for correct conversion When the letter “d” occurs at the end of a word, e.g.: “od”, “pod”, “przed”, “nad”, “miliard”, “wyjazd”, “Witold”, “grand”, “rajd”, “hołd”, “prawd”;
-
adding or correcting conversion rules for the letter “z” for correct conversion of words: “trzeci”, “zamierza”, “poradzę”, “bezzwrotny”, and similar contexts;
-
adding or correcting conversion rules for the letter “ż” for correct conversion of words: “tożsamość”, ”manadżer”, and similar contexts;
-
adding or correcting conversion rules for the letter “ć” for correct conversion of words: “ćwiczenia”, “dziećmi”, “zadośćuczynić”, “ćwiartki”, and similar contexts;
-
adding or correcting conversion rules for the letter “f” for correct conversion of words: “Afganistan”, “Hoffman”, and similar contexts;
-
adding or correcting conversion rules for the letter “s” for correct conversion of words: “przeprosin”, “siną”, “Helsinki”, and similar contexts;
The problem of converting foreign words and acronyms was solved by using the developed dictionary in which the rules for converting foreign words and acronyms were defined. Thus, the rule-based grapheme-phoneme conversion was supplemented by the dictionary method;
Examples of grapheme-to-phoneme conversion rule-table additions implemented by the author are shown in Table 10, Table 11, Table 12, Table 13, Table 14, Table 15, Table 16 and Table 17.

A number of improvements have made it possible to increase the effectiveness of the grapheme-to-phoneme conversion process.

The results obtained were compared to the results available in the literature [51,52,53,54,55]. Table 18 presents WER and PER values of the developed grapheme-to-phoneme conversion system before improvements. Table 19 presents the WER and PER values of the developed grapheme-to-phoneme conversion system after improvements. The summary of evaluation results for the system developed, before and after improvements, is presented in Table 20. The results presented in Table 20 indicate that conversion performance improved significantly after the implementation of the improvement.

As the final result of improvements to the G2P implementation, the software application called TransFon was developed, which allows automatic grapheme-to-phoneme conversion of any orthographic text file in Polish [48]. Numerous tests of this software application on a large orthographic text corpus [56], in a variety of operating environments, confirmed the accuracy of the application’s operation.

To demonstrate the operation of TransFon application, selected works of Polish literature were subjected to automatic grapheme-to-phoneme conversion. A sample fragment of Adam Mickiewicz’s epic poem “Pan Tadeusz” and a fragment of the output file obtained after automatic grapheme-to-phoneme conversion are presented with the printouts attached to this paper. An example fragment of an input text file containing a orthographic text from the epic poem “Pan Tadeusz” by Adam Mickiewicz is shown in Listing 1 [57].

Listing 1. A sample fragment of the input text file containing orthographic text from Adam Mickiewicz’s epic poem “Pan Tadeusz”.

Litwo! Ojczyzno moja! ty jesteś jak zdrowie.
Ile cię trzeba cenić, ten tylko się dowie,
Kto cię stracił. Dziś piękność twą w całej ozdobie
Widzę i opisuję, bo tęsknię po^~tobie.
Panno Święta, co jasnej bronisz Częstochowy
I w Ostrej świecisz Bramie! Ty, co gród zamkowy
Nowogródzki ochraniasz z jego wiernym ludem!
Jak mnie dziecko do zdrowia powróciłaś cudem
(Gdy od płaczącej matki pod Twoją opiekę
Ofiarowany, martwą podniosłem powiekę
I zaraz mogłem pieszo do Twych świątyń progu
Iść za wrócone życie podziękować Bogu),
Tak nas powrócisz cudem na Ojczyzny łono.
Tymczasem przenoś moję duszę utęsknioną
Do tych pagórków leśnych, do tych łąk zielonych,
Szeroko nad błękitnym Niemnem rozciągnionych;
Do tych pól malowanych zbożem rozmaitem,
Wyzłacanych pszenicą, posrebrzanych żytem;
Gdzie bursztynowy świerzop, gryka jak śnieg biała,
Gdzie panieńskim rumieńcem dzięcielina pała,
A wszystko przepasane, jakby wstęgą, miedzą
Zieloną, na niej z rzadka ciche grusze siedzą.
…

An example fragment of an output file containing phoneme transcriptions of Adam Mickiewicz’s the epic poem “Pan Tadeusz” is shown in Listing 2.

Listing 2: A fragment of the output file containing phonemic notation as a result of automatic grapheme-to-phoneme conversion of Adam Mickiewicz’s epic poem “Pan Tadeusz”.

[litvo] [ojtSIzno] [moja] [tI] [jestes’] [jak] [zdrovje]
[ile] [ts’e] [tSeba] [tsen’its’] [ten] [tIlko] [s’e] [dovje]
[kto] [ts’e] [strats’iw] [dz’is’] [pjenknos’ts’] [tvoe^~] [f] [tsawej] [ozdobje]
[vidze] [i] [opisuje] [bo] [te^~skn’e] [po] [tobje]
[panno] [s’vjenta] [tso] [jasnej] [bron’iS] [tSe^~stoxovI]
[i] [f] [ostrej] [s’vjets’iS] [bramje] [tI] [tso] [grut] [zamkovI]
[novogrutsk’i] [oxran’aS] [s] [jego] [vjernIm] [ludem]
[jak] [mn’e] [dz’etsko] [do] [zdrovja] [povruts’iwas’] [tsudem]
[gdI] [ot] [pwatSontsej] [matk’i] [pot] [tvojoe^~] [opjeke]
[ofjarovanI] [martvoe^~] [podn’oswem] [povjeke]
[i] [zaras] [mogwem] [pjeSo] [do] [tvIx] [s’vjontIn’] [progu]
[is’ts’] [za] [vrutsone] [ZIts’e] [podz’enkovats’] [bogu]
[tak] [nas] [povruts’iS] [tsudem] [na] [ojtSIznI] [wono]
[tImtSasem] [pSenos’] [moje] [duSe] [ute^~skn’onoe^~]
[do] [tIx] [pagurkuf] [les’nIx] [do] [tIx] [wonk] [z’elonIx]
[Seroko] [nat] [bwenk’itnIm] [n’emnem] [rosts’ongn’onIx]
[do] [tIx] [pul] [malovanIx] [zboZem] [rozmaitem]
[vIzwatsanIx] [pSen’itsoe^~] [posrebZanIx] [ZItem]
[gdz’e] [burStInovI] [s’vjeZop] [grIka] [jak] [s’n’ek] [bjawa]
[gdz’e] [pan’en’sk’im] [rumjen’tsem] [dz’en’ts’elina] [pawa]
[a] [fSIstko] [pSepasane] [jagbI] [fstengoe^~] [mjedzoe^~]
[z’elonoe^~] [na] [n’ej] [s] [Zatka] [ts’ixe] [gruSe] [s’edzoe^~]
…

One measure of the performance of an automatic conversion of graphemes into phonemes algorithm is word processing speed (WPS). The WPS value for the application TransFon was estimated by measuring the time required to convert the test text, Adam Mickiewicz’s epic poem “Pan Tadeusz”, containing 10,972 lines of text and 70,291 words [57]. Automatic grapheme-to-phoneme conversion of the whole text file took about 38 s. It is possible to calculate that the TransFon application can process text files at an average speed of 1849 words per second, which is 110,940 words per minute:

W P S = \frac{70291 [w o r d s]}{38 [s e c o n d s]} \approx 1849 [\frac{w o r d s}{s e c o n d}] = 110940 [\frac{w o r d s}{m i n u t e}]

(31)

The measurement of grapheme-to-phoneme conversion speed was performed on a PC workstation with CPU Intel Core i7 4770K @ 3.50 GHz. The computational power of this CPU is around 130,000 MIPS.

Another application of the TransFon software can be assistance in language corpora processing. Development of statistical methods for speech recognition requires access to large language corpora [58]. One of the largest corpora available for the Polish language is the National Corpus of Polish (NCP) [56]. The NCP corpus is available to the scientific community, offers great flexibility, and has great scientific value. This corpus provides the scientific community, particularly linguists, as well as computer scientists interested in natural language processing, materials showing the contemporary state of the Polish language, meeting all the scientific requirements. Word-frequency lists obtained using the National Corpus of Polish seem particularly valuable. Automatic grapheme-to-phoneme conversion allows creating large phonemic language corpora from orthographic language corpora. A phonemic language corpus for Polish was developed by the author using automatic grapheme-to-phoneme conversion of an orthographic language corpus, in order to be able to perform statistical phonological analysis of the Polish language, and to develop phoneme-based statistical language models for Polish to improve automatic speech recognition [17,18,19].

A sample fragment of the frequency list for the phonemic corpus of Polish developed is presented in Listing 3. The phonetic frequency list file contains 19,43,458 Polish words written orthographically, their phonetic transcriptions in the SAMPA phonemic alphabet, and additionally the number of each word occurring in the NCP source corpus. The total number of word-tokens in the NCP corpus is 230,300,300. This describes a plain text version of the NCP corpus file containing 230 million words in Polish. It should also be noted that, the standard SAMPA transcriptions for Polish include several sequences of phonetic transcription labels that may cause ambiguity unless separated by spaces or other characters. To avoid this problem, the individual phonemes were separated by square brackets.

Listing 3. A sample fragment of the frequency list file of the phonemic language corpus developed for Polish.

7692997 w [f]
5333210 i [i]
4235003 na [na]
4158902 z [s]
3981525 się [s’e]
3601719 nie [n’e]
2904114 do [do]
2205896 że [Ze]
2171877 to [to]
1731304 o [o]
1728527 jest [jest]
1425793 a [a]
1003027 jak [jak]
983395 po [po]
912660 od [ot]
877522 ale [ale]
847373 za [za]
775006 przez [pSes]
754024 co [tso]
663771 dla [dla]
645573 czy [tSI]
610035 tym [tIm]
607673 już [juS]
544429 są [soe^~]
544343 tak [tak]
534509 tylko [tIlko]
500801 ma [ma]
475172 może [moZe]
451225 tego [tego]
445705 ze [ze]
426201 jego [jego]
413855 oraz [oras]
392445 bo [bo]
390741 które [kture]
386863 będzie [ben’dz’e]
376600 ich [ix]
367282 tej [tej]
359528 było [bIwo]
358750 też [teS]
357026 który [kturI]
349641 jeszcze [jeStSe]
348423 był [bIw]
346507 być [bIts’]
345596 jednak [jednak]
342989 przy [pSI]
…

Making the developed rule-based implementation available could be very useful to the community—both for linguistic research and for more in-depth performance comparisons between the current rule-based solution and others. Therefore, it is planned to make TransFon application available after making necessary corrections in the implemented algorithm and determining the conditions and license of making the application available for other users.

5. Statistical Approach

Various grapheme-to-phoneme conversion methods can be used, but three types are most important and most often used: rule-based methods, dictionary-based methods, and statistical methods [51].

Rule-based grapheme-to-phoneme conversion methods may be complemented by dictionary-based and statistical methods. The rule-based grapheme-to-phoneme conversion system discussed in this paper was complemented by the dictionary-based method. Statistical methods can also improve grapheme-to-phoneme conversion for Polish in the future. For this purpose, the author performed statistical analysis of the grapheme-to-phoneme conversion rules in Polish.

For better analysis of the obtained results, a special naming scheme for conversion rules was adopted. The naming conventions of the grapheme-to-phoneme conversion rules were as follows:

R_{i} = L_R_C_P

(32)

where:

R_{i}

is the name of the rule, L is an orthographic letter to which the rule applies, R is a row number in the grapheme-to-phoneme conversion table, C is a column number in the grapheme-to-phoneme conversion table, and P is a phonemic letter, to which the rule applies. A sample list of grapheme-to-phoneme conversion rules for letter “ą” using these naming conventions is presented in Table 21.

The frequency of each

R_{i}

grapheme-to-phoneme conversion rule was calculated during automatic grapheme-to-phoneme conversion of an orthographic text corpus file in Polish containing 1,843,069,533 letters, obtained from the National Corpus of Polish. A sample list of the most frequently used grapheme-to-phoneme conversion rules in Polish is presented in Table 22.

The frequencies of grapheme-to-phoneme conversion rules used in Polish are presented in Figure 2.

The values presented in Table 22 depend on the frequencies of letters in Polish orthography, which were also calculated for the orthographic language corpus used. The frequencies of the letters of Polish orthography are presented in Table 23 and Figure 3.

The frequencies of words occurring in any language are well described by Zipf’s law [59].

Z_{r} = \frac{a}{r^{b}}

(33)

where

Z_{r}

is the frequency of the word ranked, the rank of the word r reflects ranking from most frequent

(r = 1)

to least frequent

(r = n)

, and a and b are parameters estimated from the statistical data. One usually finds that b is close to 1 [59]. Zipf’s law is not restricted to language [60]. However, Zipf’s law does not well describe the distributions of orthographic letters and phonemes in representations of words. The examination of such frequencies for 95 languages, as found in the literature [60], shows that phonemic and orthographic letter frequencies are best described by an equation first developed by Yule, which also describes the distribution of DNA codons [61]. The frequencies of letters in a language’s orthography are well described by Yule’s equation [60]:

Y_{r} = \frac{a}{r^{b}} \cdot c^{r}

(34)

where

Y_{r}

is the frequency of the letter, r is the rank of the letter, if frequencies are ranked from highest

(r = 1)

to lowest

(r = n)

, and a, b, and c are parameters estimated from the statistical data. The fit of Yule’s equation to the ranked frequency distribution of Polish orthographic letters is presented in Figure 4.

The average fit of Yule’s equation to the ranked frequency distribution of the Polish letters was measured by the coefficient of determination

R^{2}

. The coefficient of determination for the fit of Yule’s equation, presented in Formula (34), correctly yields the ranked frequency distribution of Polish letters:

R^{2} = 0.97509

(35)

Additionally, the root mean square error

R M S E

value was calculated for this case:

R M S E = 2.9071 \cdot 10^{- 3}

(36)

The

R^{2}

value indicates how well statistical data fit into a statistical model. The

R^{2}

value here is equal to 0.97509 and indicates that Yule’s equation fits very well the obtained statistical data on orthographic letter frequencies in Polish. Analogous regularity was observed for the frequency distribution of the grapheme-to-phoneme conversion rules for Polish presented in Table 22. Figure 5 presents the fit of Yule’s equation to the ranked frequency distribution of grapheme-to-phoneme conversion rules used for Polish.

The summary of evaluation results for the fit of Yule’s equation to the ranked frequency distribution of orthographic letters (1) and grapheme-to-phoneme conversion rules in Polish (2) is presented in Table 24.

The values of

R^{2}

presented in Table 24 indicate that Yule’s equation fits the obtained statistical data for the frequencies of orthographic letters and the grapheme-to-phoneme conversion rules in Polish. On this basis, it can be concluded that the data obtained from statistical analysis of grapheme-to-phoneme conversion rules used in Polish, based on an orthographic language corpus, are reliable. The results presented, from the statistical analysis of grapheme-to-phoneme conversion rules in Polish, represent the basis for a statistical approach to Polish G2P in a future implementation.

6. Conclusions

The results of the grapheme-to-phoneme conversion research presented in this paper were compared to other results published in the literature [3,27,39,41,42,43,45,51,53,55,62,63,64,65,66,67,68,69,70,71,72,73]. On the basis of this comparison, the following conclusions can be drawn:

Automatic conversion of graphemes into phonemes in orthographic texts is not only a technical issue, consisting in developing appropriate algorithms for converting graphemes into phonemes, but also a serious linguistic problem. Only specialists in linguistics and phonetics of a given language are able to formulate appropriate rules for converting graphemes into phonemes for speech [51];
An additional complication is that automatic conversion of graphemes to phonemes is a language-specific problem with different spelling and pronunciation conventions within the same language [55,68,69,70];
Effective solutions for automatic grapheme-to-phoneme conversion in one language may not help solve the same problems for a different language. There is not only one language and technical problem of automatic conversion of graphemes to phonemes to be solved, but many different problems with different levels of difficulty that should be solved for each language separately [51];
Automatic grapheme-to-phoneme conversion is widely used not only in speech synthesis, but also in speech recognition [3,53];
A separate, but very important problem is the evaluation of grapheme-to-phoneme conversion processes [53,71]. Evaluation and validation of grapheme-to-phoneme conversion implementations is a laborious and time-consuming process. All problems registered for the G2P implementation discussed in this paper were positively resolved;
The G2P implementation developed for this research is not the only one for Polish [27,39,41,43,45], however only one of the others is available for free use [41];
The author of the paper analysed for comparison the only available application for the Polish language, named Transcriber [41]. The application was implemented in the C++ programming language. The implemented method uses a dictionary of 5018 words and 767 defined conversion rules. For comparison, the software presented by the author in this paper was implemented in Python programming language, 975 conversion rules were implemented and the dictionary is very limited and plays only a supporting role. This means that TransFon has implemented 208 more transcription rules, which is over 27% more. The application failed to compile due to the lack of inclusion in the source code of the appropriate libraries that were used by the programmer to create the application. This made it impossible to evaluate the correctness of the application and seriously hindered the comparison with the software created by the author of the paper; However, based on the analysis of the application’s source code, you can see that the principle of the application is also rule-based, but the author of the Transcriber application tried to refine and improve the application’s performance by adding new words to the dictionary (exceptions). The author of the TransFon application, on the other hand, tried to add and supplement transcription rules in a similar way as is known in the literature. This is evidenced by the dictionary size used in both applications;
The G2P system presented here could be used for Polish corpus development;
The G2P implementation presented here did not exploit any similar pre-existing tools [48];
It is worth noting that the solutions presented here for the development of language and speech corpora in Polish are not the only ones and publications on this subject are available [72,73];
Of particular interest are the results presented in publications by Grażyna Demenko et al. [39,62,63,64,65,66,67].

This paper presents a rule-based method for grapheme-to-phoneme conversion and an implementation for Polish. The major original author’s achievements presented in this paper are as follows:

Implementation of the known from the linguistic literature rules of converting graphemes into phonemes for the Polish language in the Python programming language [36,37,40,44];
Developing an algorithm for automatic conversion of graphemes into phonemes for the Polish language and implementing it in the Python programming language with numerous improvements;
Development of a software for automatic conversion of graphemes into phonemes called TransFon, which enables automatic conversion of graphemes into phonemes of any orthographic text files in the Polish language;
Application of the developed methods to create phoneme-based language corpora using the automatic conversion of graphemes to phonemes;
Statistical analysis of the occurrence frequency of particular grapheme-to-phoneme conversion rules in Polish;
Comparison of the results obtained with those published in the literature and discussion.

It should be noted that the research presented in this article used basic principles and the fundamental grapheme-to-phoneme conversion rules developed by Maria Steffen-Batóg and presented in her set of monographs dedicated to the automatic grapheme-to-phoneme conversion of texts in Polish [36,37]. The author used previously developed rules and developed independently the grapheme-to-phoneme conversion algorithm and software application. The application allows to convert any text in Polish orthography to the corresponding strings of phonemes as well as creating large phonemic language corpora based on orthographic language corpora.

The system for rule-based grapheme-to-phoneme conversion implemented here is complemented by dictionary-based methods, and was used to obtain statistics for the use of grapheme-to-phoneme conversion rules in Polish, potentially enabling the improvement of grapheme-to-phoneme conversion for Polish in the future.

The grapheme-to-phoneme conversion system developed and its ability to create phonemic language corpora for Polish open up further opportunities for research on improving automatic speech recognition in Polish. The plan for further research towards achieving this goal, using the phonemic language corpus developed, includes:

Performing a better and more detailed statistical analysis of the Polish language based on the phonemic language corpus developed [17,19];
Developing more efficient word-based and phoneme-based statistical language models for speech recognition applications in Polish [18,19];
Application of deep learning methods to language modelling and speech recognition [20,21].

The main problem in the development of phoneme-based statistical language models for Polish is the difficulty in obtaining sufficiently large phonemic language corpora. The phonemic language corpus development method presented in this paper, based on automatic grapheme-to-phoneme conversion, can significantly remedy this problem.

Funding

This work was supported by the Polish Ministry of Science and Higher Education funding for statutory activities on Silesian University of Technology.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The author declares no conflict of interest.

References

Hirschberg, J.; Manning, C.D. Advances in natural language processing. Science 2015, 349, 261–266. [Google Scholar]
Lee, F. Automatic grapheme-to-phone translation of english. J. Acoust. Soc. Am. 1967, 41, 1594A. [Google Scholar]
Bagshaw, P. Phonemic transcription by analogy in text-to-speech synthesis: Novel word pronunciation and lexicon compression. Comput. Speech Lang. 1998, 12, 119–142. [Google Scholar]
Kawaguchi, Y.; Takagaki, T.; Tomimori, N.; Tsuruga, Y. Corpus-Based Perspectives in Linguistics. In Usage-Based Linguistic Informatics; John Benjamins Publishing Company: Amsterdam, The Netherlands, 2007. [Google Scholar]
Kłosowski, P. Speech Processing Application Based on Phonetics and Phonology of the Polish Language. In Communications in Computer and Information Science, Proceedings of the 17th International Conference Computer Networks, Ustron, Poland, 15–19 June 2010; Kwiecien, A., Gaj, P., Stera, P., Eds.; Computer Nerworks; Springer: Berlin, Germany, 2010; Volume 79, pp. 236–244. [Google Scholar]
Kłosowski, P. Improving speech processing based on phonetics and phonology of Polish language. Prz. Elektrotech. 2013, 89, 303–307. [Google Scholar]
Izydorczyk, J.; Kłosowski, P. Acoustic properties of Polish vowels. Bull. Polish Acad. Sci. Tech. Sci. 1999, 47, 29–37. [Google Scholar]
Izydorczyk, J.; Kłosowski, P. Base acoustic properties of Polish speech. In Proceedings of the International Conference Programable Devices and Systems PDS2001 IFAC Workshop (IFAC 2001), Gliwice, Poland, 22–23 November 2001; pp. 61–66. [Google Scholar]
Kłosowski, P.; Dustor, A.; Izydorczyk, J.; Kotas, J.; Slimok, J. Speech Recognition Based on Open Source Speech Processing Software. In Communications in Computer and Information Science, Proceedings of the 21st International Science Conference on Computer Networks (CN), Brunow, Poland, 23–27 June 2014; Kwiecien, A., Gaj, P., Stera, P., Eds.; Computer Networks, CN; Springer: Berlin, Germany, 2014; Volume 431, pp. 308–317. [Google Scholar]
Dustor, A.; Kłosowski, P. Biometric Voice Identification Based on Fuzzy Kernel Classifier. In Communications in Computer and Information Science, Proceedings of the 20th International Conference on Computer Networks (CN), Lwowek Slaski, Poland, 17–21 Jun 2013; Kwiecien, A., Gaj, P., Stera, P., Eds.; Computer Networks, CN; Springer: Berlin, Germany, 2013; Volume 370, pp. 456–465. [Google Scholar]
Dustor, A.; Kłosowski, P.; Izydorczyk, J. Influence of Feature Dimensionality and Model Complexity on Speaker Verification Performance. In Communications in Computer and Information Science, Proceedings of the 21st International Science Conference on Computer Networks (CN), Brunow, Poland, 23–27 June 2014; Kwiecien, A., Gaj, P., Stera, P., Eds.; Computer Networks, CN; Springer: Berlin, Germany, 2014; Volume 431, pp. 177–186. [Google Scholar]
Dustor, A.; Kłosowski, P.; Izydorczyk, J. Speaker recognition system with good generalization properties. In Proceedings of the 2014 International Conference on Multimedia Computing and Systems (ICMCS), Marrakech, Morocco, 14–16 April 2014; IEEE: Piscataway, NJ, USA, 2014; pp. 206–210. [Google Scholar]
Dustor, A.; Kłosowski, P.; Izydorczyk, J.; Kopanski, R. Influence of Corpus Size on Speaker Verification. In Communications in Computer and Information Science, Proceedings of the 22nd International Conference on Computer Networks (CN), Brunow, Poland, 16–19 June 2015; Gaj, P., Kwiecien, A., Stera, P., Eds.; Computer Networks, CN; Springer: Berlin, Germany, 2015; Volume 522, pp. 242–249. [Google Scholar]
Kłosowski, P.; Dustor, A.; Izydorczyk, J. Speaker verification performance evaluation based on open source speech processing software and timit speech corpus. In Communications in Computer and Information Science, Proceedings of the 22nd International Conference on Computer Networks (CN), Brunow, Poland, 16–19 June 2015; Gaj, P., Kwiecien, A., Stera, P., Eds.; Computer Networks, CN; Springer: Berlin, Germany, 2015; Volume 522, pp. 400–409. [Google Scholar]
Kłosowski, P.; Dustor, A. Automatic Speech Segmentation for Automatic Speech Translation. In Communications in Computer and Information Science, Proceedings of the 20th International Conference on Computer Networks (CN), Lwowek Slaski, Poland, 17–21 June 2013; Kwiecien, A., Gaj, P., Stera, P., Eds.; Computer Networks, CN; Springer: Berlin, Germany, 2013; Volume 370, pp. 466–475. [Google Scholar]
Bellegarda, J.R.; Monz, C. State of the art in statistical methods for language and speech processing. Comput. Speech Lang. 2016, 35, 163–184. [Google Scholar]
Kłosowski, P. Statistical analysis of Polish language corpus for speech recognition application. In Proceedings of the 20th IEEE International Conference Signal Processing Algorithms, Architectures, Arrangements, and Applications, Poznań, Poland, 21–23 September 2016; pp. 304–309. [Google Scholar]
Kłosowski, P. Polish language modelling for speech recognition application. In Proceedings of the 21th IEEE International Conference Signal Processing Algorithms, Architectures, Arrangements, and Applications, Poznan, Poland, 20–22 September 2017; pp. 313–318. [Google Scholar]
Kłosowski, P. Statistical analysis of orthographic and phonemic language corpus for word-based and phoneme-based Polish language modelling. EURASIP J. Audio Speech Music Process. 2017, 2017, 5. [Google Scholar]
Kłosowski, P. Deep learning for natural language processing and language modelling. In Proceedings of the 22th IEEE International Conference Signal Processing Algorithms, Architectures, Arrangements, and Applications, Poznan, Poland, 19–21 September 2018; pp. 223–228. [Google Scholar]
Kłosowski, P. Polish language modelling based on deep learning methods and techniques. In Proceedings of the 23th IEEE International Conference Signal Processing Algorithms, Architectures, Arrangements, and Applications, Poznan, Poland, 18–20 September 2019; pp. 223–228. [Google Scholar]
Adda-Decker, M. Corpus for automatic speech recognition. Rev. Fr. Linguist. Appl. 2007, 12, 71–84. [Google Scholar]
Drgas, S.; Dabrowski, A. Speaker recognition based on multilevel speech signal analysis on Polish corpus. Multimed. Tools Appl. 2015, 74, 4195–4211. [Google Scholar]
Furui, S. Recent progress in corpus-based spontaneous speech recognition. IEICE Trans. Inf. Syst. 2005, 88, 366–375. [Google Scholar]
Lecouteux, B.; Linares, G.; Oger, S. Integrating imperfect transcripts into speech recognition systems for building high-quality corpora. Comput. Speech Lang. 2012, 26, 67–89. [Google Scholar]
Coulmas, F. The Blackwell’s Encyclopedia of Writing Systems; Blackwells: Oxford, UK, 1996. [Google Scholar]
Przybysz, P.; Kasprzak, W. The generation of letter-to-sound rules for grapheme-to-phoneme conversion. In Proceedings of the 2013 6th International Conference on Human System Interactions (HSI), Sopot, Poland, 6–8 June 2013; pp. 292–297. [Google Scholar]
International Phonetic Association. Handbook of the International Phonetic Association: A Guide to the Use of the International Phonetic Alphabet; A Regents Publication; Cambridge University Press: Cambridge, UK, 1999. [Google Scholar]
Sussex, R.; Cubberley, P. The Slavic Languages. In Cambridge Language Surveys; Cambridge University Press: Cambridge, UK, 2006. [Google Scholar]
Wells, J. SAMPA computer readable phonetic alphabet. In Handbook of Standards and Resources for Spoken Language Systems, Vol. Part IV, Section B; Gibbon, D., Moore, R., Winski, R., Eds.; Mouton de Gruyter: Berlin, Germany; New York, NY, USA, 1997. [Google Scholar]
Kučera, H. Mechanical phonemic transcription and phoneme frequency count of czech. Int. J. Slav. Lingguistic Phon. 1963, 6, 36–50. [Google Scholar]
Bhimani, B.; Dolby, J. Acoustic phonetic transcription of written English. In Annual Report: Automatic Indexing and Abstracting; AIP Publishing: Palo Alto, CA, USA, 1966. [Google Scholar]
Pratt, B.; Silva, G. Phontrns: A Procedure which Uses a Computer for Transcribing French Text info Phonetic Symbols; Monash University: Melbourne, Australia, 1967. [Google Scholar]
Ungerhruer, G.; Kästner, W. Untersuchungen zur Transformation Deutcher Schirifttexte in Entsprechende Phonemtexte mit Hilfe Elektronischer Rechenmaschinen; Forschungsbericht; Institut für Phonetic und Kommunikationsforshung der Universität Bonn: Bonn, Germany, 1966. [Google Scholar]
Doroszewski, W. Speech and writing (in Polish: Mowa a pismo). Porad. Jęz. 1969, 4, 181–188. [Google Scholar]
Steffen-Batóg, M. The problem of automatic phonemic transcription of written Polish. Biul. Fonogr. 1973, XIV, 75–86. [Google Scholar]
Steffen-Batóg, M. Automatic Phonemic Transcription of Polish Texts (In Polish: Automatyzacja Transkrypcji Fonematycznej Tekstów Polskich); Wydawnictwo Naukowe PWN: Warszawa, Poland, 1975. [Google Scholar]
Warmus, M. Software implementation for ODRA 1204 of automatic phonemic transctiption of polish texts (in Polish: Program na maszynę ODRA 1204 dla automatycznej transkrypcji fonematycznej tekstów języka polskiego). In Zastosowanie Maszyn Matematycznych do Badań nad Językiem Naturalnym; Bolc, L., Ed.; Wydawnictwo Uniwersytetu Warszawskiego: Warszawa, Poland, 1973. [Google Scholar]
Demenko, G.; Wypych, M.; Baranowska, E. Implementation of grapheme-to-phoneme rules and extended SAMPA alphabet in Polish text-to-speech synthesis. Speech Lang. Technol. 2003, 7, 79–97. [Google Scholar]
Jassem, W. A phonemic transcription and syllable division rule engine. In Onomastica-Copernicus Research Colloquium; University of Edinburg: Edinburgh, UK, 1996. [Google Scholar]
Koržinek, D.; Brocki, Ł.; Marasek, K. Polish Grapheme-to-Phoneme Tool and Service, CLARIN-PL Digital Repository (2016). Available online: https://clarin-pl.eu/dspace/handle/11321/295 (accessed on 10 January 2022).
Koržinek, D.; Marasek, K.; Brocki, Ł.; Wołk, K. Polish read speech corpus for speech tools and services. In CLARIN Common Language Resources and Technology Infrastructure, Proceedings of the Selected Papers from the CLARIN Annual Conference 2016, Aix-en-Provence, France, 26–28 October 2016; Number 136; Linköping University Electronic Press, Linköpings Universitet: Linköpings, Sweden, 2017; pp. 54–62. [Google Scholar]
Skurzok, D.; Ziółko, B.; Ziółko, M. Ortfon2—Tool for orthographic to phonetic transcription. In Proceedings of the 7th Language & Technology Conference, Poznań, Poland, 27–29 November 2015. [Google Scholar]
Steffen-Batóg, M.; Nowakowski, P. An algorithm for phonetic transcription of orthographic texts in Polish. In Studia Phonetica Posnaniensia; Steffen-Batóg, M., Awedyk, W., Eds.; Wydawnictwo Naukowe UAM: Poznań, Poland, 1993; Volume 3. [Google Scholar]
Wypych, M. Implementation of phonenic transcription alghorithm (in Polish: Implementacja algorytmu transkrypcji fonematycznej). In Speech and Language Technology; Polskie Towarzystwo Fonetyczne: Poznań, Poland, 1999; Volume 3. [Google Scholar]
Razavi, M.; Rasipuram, R.; Doss, M.M. Acoustic data-driven grapheme-to-phoneme conversion in the probabilistic lexical modeling framework. Speech Commun. 2016, 82, 1–21. [Google Scholar]
Kaplan, R.M.; Kay, M. Regular models of phonological rule systems. Comput. Linguist. 1994, 20, 331–378. [Google Scholar]
Kłosowski, P. Algorithm and implementation of automatic phonemic transcription for Polish. Proceedings of 20th IEEE International Conference Signal Processing Algorithms, Architectures, Arrangements, and Applications, Poznań, Poland, 21–23 September 2016; pp. 298–303. [Google Scholar]
Python Software Foundation: About Python (2014). Available online: https://www.python.org/about/ (accessed on 10 January 2022).
Przepiórkowski, A.; Bańko, M.; Górski, R.L. Lewandowska-Tomaszczyk, B. The National Corpus of Polish (in Polish: Narodowy Korpus Języka Polskiego); Wydawnictwo Naukowe PWN: Warszawa, Poland, 2012. [Google Scholar]
Auzina, I.; Pinnis, M.; Dargis, R. Comparison of Rule-based and Statistical Methods for Grapheme to Phoneme Modelling. In Frontiers in Artificial Intelligence and Applications, Proceedings of the Human Language Technologies—The Baltic Perspective, Baltic HLT 2014, Kaunas, Lithuania, 26–27 September 2014; Utka, A., Grigonyte, G., Kapociute Dzikiene, J., Vaicenoniene, J., Eds.; Vytautas Magnus University ViaConventus: Vilnius, Lithuania, 2014; Volume 268, pp. 57–60. [Google Scholar]
Decadt, B.; Duchateau, J.; Daelemans, W.; Wambacq, P. Phoneme-to-grapheme conversion for out-of-vocabulary words in large vocabulary speech recognition. In Proceedings of the IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU ’01), Madonna di Campiglio, Italy, 9–13 December 2001; pp. 413–416. [Google Scholar]
Jouvet, D.; Fohr, D.; Illina, I. Evaluating grapheme-to-phoneme converters in automatic speech recognition context. In Proceedings of the 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Kyoto, Japan, 25–30 March 2012; pp. 4821–4824. [Google Scholar]
Kheang, S.; Katsurada, K.; Iribe, Y.; Nitta, T. Novel two-stage model for grapheme-to-phoneme conversion using new grapheme generation rules. In Proceedings of the 2014 International Conference of Advanced Informatics: Concept, Theory and Application (ICAICTA), Bandung, Indonesia, 20–21 August 2014; pp. 97–102. [Google Scholar]
Schlippe, T.; Ochs, S.; Schultz, T. Grapheme-to-phoneme model generation for indo-european languages. In Proceedings of the 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Kyoto, Japan, 25–30 March 2012; pp. 4801–4804. [Google Scholar]
Przepiórkowski, A.; Górski, R.L.; Lewandowska-Tomaszczyk, B.; Aziński, M. Towards the national corpus of Polish. In Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC 2008), Marrakech, Morocco, 28–30 May 2008; European Language Resources Association (ELRA): Paris, France, 2008. [Google Scholar]
Mickiewicz, A. Pan Tadeusz, Czyli, Ostatni Zajazd na Litwie: Historja Szlachecka z r. 1811 i 1812, We Dwunastu Księgach, Wierszem; Wydawnictwo Zakładu Narodowego Im. Ossolińskich: Warszawa, Poland, 1834; Available online: https://wolnelektury.pl/katalog/lektura/pan-tadeusz.html (accessed on 10 January 2022).
Ney, H. Corpus-based statistical methods in speech and language processing. In Text, Speech and Language Technology, Proceedings of the 2nd European Summer School on Language and Speech Communication, Utrecht, The Netherlands, 1994; Corpus-Based Methods in Language and Speech Processing; Young, S., Bloothooft, G., Eds.; Kluwer Academic Publishers: London, UK, 1997; Volume 2, pp. 4–26. [Google Scholar]
Zipf, G.K. Human behavior and the principle of least effort. J. Clin. Psychol. 1950, 6, 306. [Google Scholar]
Tambovtsev, Y.; Martindale, C. Phoneme Frequencies Follow a Yule Distribution. SKASE J. Theor. Linguist. 2008, 4, 1–11. [Google Scholar]
Yule, G.U. A mathematical theory of evolution, based on the conclusions of Dr. J. C. Willis, F.R.S. Philos. Trans. R. Soc. Lond. B Biol. Sci. 1925, 213, 21–87. [Google Scholar]
Cylwik, N.; Wagner, A.; Demenko, G. The euronounce corpus of non-native polish for asr-based pronunciation tutoring system. In Proceedings of the 2nd ISCA Workshop of Speech and Language Technology in Education, Warwickshire, UK, 3–5 September 2009. [Google Scholar]
Demenko, G. Korpusowe Badania JęZyka MóWionego; Akademicka Oficyna Wydawnicza EXIT: Warszawa, Polish, 2015; ISBN 9788378370437. [Google Scholar]
Demenko, G.; Bachan, J.; Wagner, A.; Wyroślak, P. Speech corpus creation for automatic analysis of phonetic convergence. In Studientexte zur Sprachkommunikation, Proceedings of 27th Conference on Electronic Speech Signal Processing (ESSV), Leipzig, Germany, 2–4 March 2016; Oliver, J., Ed.; Hochschule für Telekommunikation Leipzig (HfTL): Leipzig, Germany, 2016; pp. 183–190. [Google Scholar]
Demenko, G.; Grocholewski, S.; Klessa, K.; Rau, Z. Polish language resources for speech technology: Jurisdic lvcsr corpora. In Human Language Technologies as a Challenge for Computer Science and Linguistics, Proceedings of the 4th Language & Technology Conference, Poznań, Poland, 6–8 November 2009; Zygmunt, V., Ed.; Adam Mickiewicz University: Poznań, Poland, 2009; pp. 165–169. [Google Scholar]
Demenko, G.; Klessa, K.; Szymański, M.; Breuer, S.; Hess, W. Polish unit selection speech synthesis with boss: Extensions and speech corpora. Int. J. Speech Technol. 2010, 13, 85–99. [Google Scholar]
Demenko, G.; Szymański, M.; Cecko, R.; Lange, M.; Klessa, K.; Owsianny, M. Development of large vocabulary continuous speech recognition using phonetically structured speech corpus. In Proceedings of the 17th International Congress of Phonetic Sciences (ICPhS XVII), Hong Kong, China, 17–21 August 2011; pp. 568–571. [Google Scholar]
Kosaner, O.; Birant, C.C.; Aktas, O. Improving Turkish language training materials: Grapheme-to-phoneme conversion for adding phonemic transcription into dictionary entries and course books. In Procedia Social and Behavioral Sciences, Proceedings of the 13th International Educational Technology Conference, Lisbon, Portugal, 30 October–1 November 2014; Isman, A., Siraj, S., Kiyici, M., Eds.; Volume 103, pp. 473–484.
Lee, J.; Kim, B.; Lee, G.G. Hybrid Approach to Grapheme to Phoneme Conversion for Korean. In Proceedings of the InterSpeech 2009: 10th Annual Conference of the International Speech Communication Association 2009, Brighton, UK, 6–10 September 2009; Volume 1–5, pp. 1299–1302. [Google Scholar]
de Jesus Aguiar Pontes, J.; Furui, S. Predicting the phonetic realizations of word-final consonants in context—A challenge for French grapheme-to-phoneme converters. Speech Commun. 2010, 52, 847–862. [Google Scholar]
Schraagen, M.; Bloothooft, G. A qualitative evaluation of phoneme-to-phoneme technology. In Proceedings of the 12th Annual Conference of the International-Speech-Communication-Association 2011 (Interspeech 2011), Florence, Italy, 27–31 August 2011; Volume 1–5, pp. 2332–2335. [Google Scholar]
Żelasko, P.; Ziółko, B.; Jadczyk, T.; Skurzok, D. AGH corpus of Polish speech. Lang. Resour. Eval. 2016, 50, 585–601. [Google Scholar]
Ziółko, B.; Jadczyk, T.; Skurzok, D.; Żelasko, P.; Gałka, J.; Pȩdzima̧ż, T.; Gawlik, I.; Pałka, S. SARMATA 2.0 automatic Polish language speech recognition system. In Proceedings of the Sixteenth Annual Conference of the International Speech Communication Association, Dresden, Germany, 6–10 September 2015; Interspeech: Dresden, Germany, 2015. [Google Scholar]

Figure 1. The block diagram of a grapheme-to-phoneme conversion algorithm for a single orthographic letter.

Figure 2. The frequencies of grapheme-to-phoneme conversion rules used in Polish.

Figure 3. The frequencies of letters in Polish orthography.

Figure 4. The fit of Yule’s equation to the ranked frequency distribution of the Polish orthographic letters.

Figure 5. The fit of Yule’s equation to the ranked frequency distribution of grapheme-to-phoneme conversion rules used for Polish.

Table 1. The set of Polish phonemes with examples, written in the SPA, IPA, and SAMPA phonetic alphabets, which corresponds to the set of phonemes used for the purpose of this study.

	Phonetic Alphabet			Example of
No.	Symbols			Occurrence
	[SPA]	[IPA]	[SAMPA]	in Polish
1	[e]	[ɛ]	`[e]`	serce
2	[a]	[ɑ]	`[a]`	baba
3	[o]	[ɔ]	`[o]`	oko
4	[t]	[t]	`[t]`	trawa
5	[n]	[n]	`[n]`	noc
6	[y]	[ɨ]	`[I]`	syty
7	[i̯]	[j]	`[j]`	jajo
8	[i]	[i]	`[i]`	wici
9	[r]	[r]	`[r]`	rok
10	[s]	[s]	`[s]`	sok
11	[v]	[v]	`[v]`	wada
12	[p]	[p]	`[p]`	praca
13	[u]	[u]	`[u]`	buk
14	[m]	[m]	`[m]`	mama
15	[k]	[k]	`[k]`	kot
16	[ń]	[ɲ]	`[n’]`	koń
17	[d]	[d]	`[d]`	dudek
18	[l]	[l]	`[l]`	lato
19	[u̯]	[ɫ]	`[w]`	łysy
20	[š]	[ʃ]	`[S]`	szyszka
21	[f]	[f]	`[f]`	fala
22	[z]	[z]	`[z]`	koza
23	[c]	[ʦ͡]	`[ts]`	cacko
24	[b]	[b]	`[b]`	baba
25	[g]	[g]	`[g]`	godło
26	[ś]	[ɕ]	`[s’]`	siano
27	[ć]	[ʨ͡]	`[ts’]`	ciasto
28	[]	[ʝ]	`[x]`	higiena
29	[č]	[ʧ͡]	`[tS]`	czarny
30	[ž]	[ʒ]	`[Z]`	każdy
31	[]	[]	`[e ]`	ręka
32	[ḱ]	[c]	`[k’]`	kino
33	[]	[ʥ͡]	`[dz’]`	dziedzic
34	[ʒ]	[ʣ͡]	`[dz]`	nadzy
35	[ź]	[ʑ]	`[z’]`	ziarno
36	[ǵ]	[ɟ]	`[g’]`	magiczny
37	[]	[ʤ͡]	`[dZ]`	drożdże

Table 2. The general form of a grapheme-to-phoneme conversion rule-table [37].

$α_{k}$	$Δ_{1}$	…	$Δ_{j}$	…	$Δ_{m - 1}$	$Δ_{m}$
$Γ_{1}$	$β_{1, 1}$	…	$β_{1, j}$	…	$β_{1, m - 1}$	$β_{1, m}$
$Γ_{2}$	$β_{2, 1}$	…	$β_{2, j}$	…	$β_{2, m - 1}$	$β_{2, m}$
…	…	…	…	…	…	…
$Γ_{i}$	$β_{i, 1}$	…	$β_{i, j}$	…	$β_{i, m - 1}$	$β_{i, m}$
…	…	…	…	…	…	…
$Γ_{n - 1}$	$β_{n - 1, 1}$	…	$β_{n - 1, j}$	…	$β_{n - 1, m - 1}$	$β_{n - 1, m}$
$Γ_{n}$	$β_{n, 1}$	…	$β_{n, j}$	…	$β_{n, m - 1}$	$β_{n, m}$

Table 3. The grapheme-to-phoneme conversion rule-table for the letter “a” in Polish [37].

a	X
X	a

Table 4. The grapheme-to-phoneme conversion rule-table for letter “ą” in Polish [37].

ą	$Δ_{1}$	$Δ_{2}$	$Δ_{3}$	$Δ_{4}$	$Δ_{5}$	$Δ_{6}$	$Δ_{7}$	$Δ_{8}$	$Δ_{9}$
X	o	om	on	on	on	on	oń	oŋ	oŋ

Table 5. The grapheme-to-phoneme conversion rule-table for orthographic letter “u” in Polish—part 1 [37].

u	X - /	/
X - {a, e}	u	u̯
rze	u	u̯
ie	u	u̯
poza	u	u̯
pra	u	u̯
dna	u	u̯
una	u	u̯
ena	u	u̯
#na	u	u̯

Table 6. The grapheme-to-phoneme conversion rule-table for letter “u” in Polish—part 2 [37].

u	X
(X - {e, o, #})za	u̯
(X - p){oza, ra}	u̯
(X - {d, u, e, o, a, y, #})na	u̯

Table 7. The grapheme-to-phoneme conversion rule-table for letter “u” in Polish—part 3 [37].

u	k	w	ł	cz	sz	{s, m}O	t
(X - {z, i})e	u̯	u̯	u̯	u̯	u	u	u̯
(X - r)ze	u̯	u̯	u̯	u̯	u	u	u̯
(X -{z, r, n})a	u̯	u̯	u	u̯	u̯	u̯	u̯
#za	u	u	u	u	u	u	u̯
eza	u̯	u	u̯	u̯	u̯	u̯	u̯
ona	u	u̯	u̯	u	u̯	u̯	u̯
ana	u	u̯	u̯	u̯	u̯	u̯	u̯
yna	u	u̯	u̯	u̯	u̯	>u̯	>u̯

Table 8. The grapheme-to-phoneme conversion rule-table for letter “u” in Polish—part 4 [37].

u	$X - {t, k, w, ł, c, s, m}$	$c (X - z)$	$s (X - (O + z))$	$m (X - O)$
(X - {z, i})e)	u̯	u̯	u̯	u̯
(X - r)ze	u̯	u̯	u̯	u̯
(X -{z, r, n})a	u̯	u̯	u̯	u̯
#za	u	u	u	u
eza	u̯	u̯	u̯	u̯
ona	u̯	u̯	u̯	u̯
ana	u̯	u̯	u̯	u̯
yna	u̯	u̯	u̯	u̯

Table 9. The list of implemented grapheme-to-phoneme conversion rules for Polish.

Table Number	Orthographic Letter	No. of Rows	No. of Columns	No. of Cells	No. of Rules
1	z	34	64	2079	174
2	s	17	62	976	157
3	r	11	22	210	91
4	u	15	15	196	88
5	d	12	46	495	83
7	i	14	13	156	80
6	ć	12	19	198	40
8	ż	13	16	180	36
9	c	11	28	270	36
10	t	6	26	125	31
11	k	4	17	48	18
12	f	6	13	60	16
13	w	2	15	14	14
14	ź	3	9	16	13
15	ś	6	10	45	12
16	p	2	12	11	11
17	b	2	11	10	10
18	g	2	10	9	9
19	ę	2	10	9	9
20	ą	2	10	9	9
21	n	3	9	16	8
22	y	4	3	6	6
23	l	3	3	4	4
24	P	3	3	4	4
25	h	2	2	1	1
26	ł	2	2	1	1
27	ń	2	2	1	1
28	m	2	2	1	1
29	j	2	2	1	1
30	ó	2	2	1	1
31	o	2	2	1	1
32	e	2	2	1	1
33	a	2	2	1	1
34	q	2	2	1	1
35	v	2	2	1	1
36	x	2	2	1	1
37	#	2	2	1	1
38	@	2	2	1	1
39	-	2	2	1	1
40	/	2	2	1	1
			TOTAL	5162	975

Table 10. The grapheme-to-phoneme conversion rule-table addition for the letter “i” in Polish for words: “unii”, “będzie”, “sobie”, “razie”, “diabeł”, and similar contexts.

i	A-{i}	A	e	e	ab
{c,n}	1
dz		1
b			j
z				1
d					j

Table 11. The grapheme-to-phoneme conversion rule-table addition for the letter “n” in Polish for words: “branży” and similar contexts.

n	ż
X	ŋ

Table 12. The grapheme-to-phoneme conversion rule-table addition for the letter “d” in Polish for words: “od”, “pod”, “przed”, “nad”, “miliard”, “wyjazd”, “Witold”, “grand”, “rajd”, “hołd”, “prawd”, and similar contexts.

d	S
A	t
{r, z, l, n, ł, w}	t

Table 13. The grapheme-to-phoneme conversion rule-table addition for the letter “z” in Polish for words: ”trzeci”, “zamierza”, “poradzę”, “bezzwrotny”, and similar contexts.

z	X	a	{ą,ę}	z
tr	1
r		1
d			1
e				s

Table 14. The grapheme-to-phoneme conversion rule-table addition for the letter “ż” in Polish for words “tożsamość”, “manadżer”, and similar contexts.

ż	sa	er
A	ž
d		1

Table 15. The grapheme-to-phoneme conversion rule-table addition for the letter “ć” in Polish for words: “ćwiczenia”, “dziećmi”, “zadośćuczynić”, ”ćwiartki”, and similar contexts.

ć	wic	m	u	wi $\cdot A$
X	ć			ć
e		ć
ś			ć

Table 16. The grapheme-to-phoneme conversion rule-table addition for the letter “f” in Polish for words: “Afganistan”, “Hoffman”, and similar contexts.

f	ga	f
X	v	f

Table 17. The grapheme-to-phoneme conversion rule-table addition for the letter “s” in Polish for words: “przeprosin”, “siną”, “Helsinki”, and similar contexts.

s	in	iną	in
A	ś
#	ś	ś
l			s

Table 18. The WER and PER values of the developed grapheme-to-phoneme conversion system before improvements.

No.	Parameter	Value
1	Number of unique words checked	1,943,458
2	Number of G2P conversion errors for unique words	33,638
3	The WER value for unique words	1.731%
4	Number of words in the corpus	230,300,300
5	Number of G2P conversion errors for words in the corpus	3,707,890
6	The WER value for the corpus	1.610%
7	Number of checked unique words phonemes	16,293,828
8	Number of G2P conversion errors for phonemes	34,324
9	The PER value for unique words	0.211%
10	Number of phonemes in the corpus	1,263,992,460
11	Number of G2P conversion errors for phonemes in the corpus	3,713,206
12	The PER value for the corpus	0.294%

Table 19. The WER and PER values of the developed grapheme-to-phoneme conversion system after improvements.

No.	Parameter	Value
1	Number of unique words checked	1,943,458
2	Number of G2P conversion errors for unique words	7525
3	The WER value for unique words	0.387%
4	Number of words in the corpus	230,300,300
5	Number of G2P conversion errors for words in the corpus	69,802
6	The WER value for the corpus	0.030%
7	Number of checked unique words phonemes	16,282,255
8	Number of G2P conversion errors for phonemes	8063
9	The PER value for unique words	0.050%
10	Number of phonemes in the corpus	1,263,415,734
11	Number of G2P conversion errors for phonemes in the corpus	73,786
12	The PER value for the corpus	0.006%

Table 20. The summary evaluation results of developed grapheme-to-phoneme conversion system, before and after improvement.

		Value	Value
No.	Parameter	before	after
		Improvements	Improvements
1	The WER value for unique words	1.731%	0.387%
2	The WER value for the corpus	1.610%	0.030%
3	The PER value for unique words	0.211%	0.050%
4	The PER value for the corpus	0.294%	0.006%

Table 21. A sample list of grapheme-to-phoneme conversion rules for orthographic letter “ą” [37].

No.	Rule	Orthographic	Row	Column	Phoneme
	Name	Letter	Number	Number	Letters
	$R_{i}$	`L`	`R`	`C`	`P`
1	`ą_1_1_o`	ą	1	1	`[o]`
2	`ą_1_2_om`	ą	1	2	`[om]`
3	`ą_1_3_on`	ą	1	3	`[on]`
4	`ą_1_4_on`	ą	1	4	`[on]`
5	`ą_1_5_on`	ą	1	5	`[on]`
6	`ą_1_6_on`	ą	1	6	`[on]`
7	`ą_1_7_on’`	ą	1	7	`[on’]`
8	`ą_1_8_o`^~	ą	1	8	`[o^~]`
9	`ą_1_9_o`^~	ą	1	9	`[o^~]`

Table 22. A sample list of the most frequently used grapheme-to-phoneme conversion rules in Polish.

No.	Number of	Frequency	Rule
	Occurr. of	of Occurr.	Name
	1843069533	in %
i	$C (R_{i})$	$f (R_{i}) \cdot 100$	$R_{i}$
1	230609384	12.512	`@_1_1_`
2	230606397	12.512	`#_1_1_`
3	120713048	6.550	`a_1_1_a`
4	107031055	5.807	`e_1_1_e`
5	102221066	5.546	`o_1_1_o`
6	52025013	2.823	`y_3_2_I`
7	50492927	2.740	`i_12_11_i`
8	47433333	2.574	`r_1_1_r`
9	44069532	2.391	`t_1_1_t`
10	43352871	2.352	`n_1_1_n`
11	38974957	2.115	`m_1_1_m`
12	38920693	2.112	`w_1_1_v`
13	34811970	1.889	`P_1_1_`
14	31876002	1.730	`j_1_1_j`
15	30407964	1.650	`u_1_1_u`
16	28149906	1.527	`n_1_7_n’`
17	26241250	1.424	`p_1_1_p`
18	24158723	1.311	`z_2_23_`
19	23770872	1.290	`ł_1_1_w`
20	21373108	1.160	`k_1_2_k`
21	21286163	1.155	`l_1_1_l`
22	20689142	1.123	`d_1_1_d`
23	16884069	0.916	`s_1_2_s`
24	16046642	0.871	`i_3_2_`
25	15310336	0.831	`c_1_3_ts`
26	15178089	0.824	`b_1_1_b`
27	14222227	0.772	`h_1_1_x`
28	13675210	0.742	`p_1_4_p`
29	12853330	0.697	`c_1_1_`
30	12487271	0.678	`z_11_52_`
⋯	⋯	⋯	⋯

Table 23. The frequencies of letters in Polish orthography.

No.	Number of	Frequency	Letter
	Occurr. of	of Occurr.
	1345943574	in %
i	$C (c_{i})$	$f (c_{i}) \cdot 100$	$c_{i}$
1	120646654	8.96245	`a`
2	111589675	8.28964	`i`
3	106665716	7.92385	`e`
4	102135976	7.58735	`o`
5	76298546	5.66797	`z`
6	75403146	5.60146	`n`
7	61445747	4.56461	`w`
8	61016675	4.53273	`r`
9	57237441	4.25198	`s`
10	53549355	3.97801	`c`
11	53399610	3.96688	`t`
12	52240953	3.88081	`y`
13	45634859	3.39007	`k`
14	44605635	3.31361	`d`
15	41733129	3.10022	`p`
16	38973075	2.89518	`m`
17	31851311	2.36613	`j`
18	31481368	2.33865	`u`
19	28274720	2.10044	`l`
20	23763998	1.76535	`ł`
21	19865437	1.47574	`b`
22	18405166	1.36726	`g`
23	15475239	1.14961	`ę`
24	14222227	1.05652	`h`
25	13906034	1.03303	`ą`
26	12130116	0.90111	`ż`
27	11204643	0.83236	`ó`
28	9280005	0.68938	`ś`
29	6119384	0.45459	`ć`
30	4087022	0.30361	`f`
31	2474196	0.18380	`ń`
32	826516	0.06140	`ź`
33	114138	0.00848	`v`
34	65450	0.00486	`x`
35	11434	0.00085	`q`

Table 24. The evaluation results of the fit of Yule’s equation to the ranked frequency distribution of orthographic letters (1) and grapheme-to-phoneme conversion rules (2) in Polish.

No.	Yule’s Equation	$R^{2}$	$RMSE$
1	$Y_{r} = \frac{0.066783}{r^{- 0.08}} \cdot 0 . 91^{r}$	0.97509	2.9071 · 10⁻³
2	$Y_{r} = \frac{0.12512}{r^{0.458}} \cdot 0 . 954^{r}$	0.95107	1.8022 · 10⁻³

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Kłosowski, P. A Rule-Based Grapheme-to-Phoneme Conversion System. Appl. Sci. 2022, 12, 2758. https://doi.org/10.3390/app12052758

AMA Style

Kłosowski P. A Rule-Based Grapheme-to-Phoneme Conversion System. Applied Sciences. 2022; 12(5):2758. https://doi.org/10.3390/app12052758

Chicago/Turabian Style

Kłosowski, Piotr. 2022. "A Rule-Based Grapheme-to-Phoneme Conversion System" Applied Sciences 12, no. 5: 2758. https://doi.org/10.3390/app12052758

APA Style

Kłosowski, P. (2022). A Rule-Based Grapheme-to-Phoneme Conversion System. Applied Sciences, 12(5), 2758. https://doi.org/10.3390/app12052758

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Rule-Based Grapheme-to-Phoneme Conversion System

Abstract

1. Introduction

2. Problem Formulation

3. Methodology

3.1. Conversion Rules

3.2. Conversion Algorithm

4. Results

5. Statistical Approach

6. Conclusions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI