Next Article in Journal
Acoustic Emission Based Fault Detection of Substation Power Transformer
Next Article in Special Issue
Hybrid Dilated and Recursive Recurrent Convolution Network for Time-Domain Speech Enhancement
Previous Article in Journal
Recursive Optimal Finite Impulse Response Filter and Its Application to Adaptive Estimation
Previous Article in Special Issue
Harris Hawks Sparse Auto-Encoder Networks for Automatic Speech Recognition System
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A Rule-Based Grapheme-to-Phoneme Conversion System

Department of Telecommunication and Teleinformatics, Faculty of Automatic Control, Electronics and Computer Science, Silesian University of Technology, Akademicka 16, 44-100 Gliwice, Poland
Appl. Sci. 2022, 12(5), 2758; https://doi.org/10.3390/app12052758
Submission received: 12 January 2022 / Revised: 27 February 2022 / Accepted: 3 March 2022 / Published: 7 March 2022
(This article belongs to the Special Issue Automatic Speech Recognition)

Abstract

:
This article presents a rule-based grapheme-to-phoneme conversion method and algorithm for Polish. It should be noted that the fundamental grapheme-to-phoneme conversion rules have been developed by Maria Steffen-Batóg and presented in her set of monographs dedicated to the automatic grapheme-to-phoneme conversion of texts in Polish. The author used previously developed rules and independently developed the grapheme-to-phoneme conversion algorithm.The algorithm has been implemented as a software application called TransFon, which allows the user to convert any text in Polish orthography to corresponding strings of phonemes, in phonemic transcription. Using TransFon, a phonemic Polish language corpus was created out of an orthographic corpus. The phonemic language corpusallows statistical analysis of the Polish language, as well as the development of phoneme- and word-based language models for automatic speech recognition using statistical methods. The developed phonemic language corpus opens up further opportunities for research to improve automatic speech recognition in Polish. The development of statistical methods for speech recognition and language modelling requires access to large language corpora, including phonemic corpora. The method presented here enables the creation of such corpora.

1. Introduction

Natural language processing often requires grapheme-to-phoneme (G2P) conversion of an orthographic text [1]. G2P converts strings of graphemes to corresponding sequences of phonetic transcription characters, directly from orthographic representations and it is crucial for many applications in various areas of speech and language processing [2]. The most frequent applications of grapheme-to-phoneme conversion occur in text-to-speech systems, which require high quality phonetic transcriptions to function well [3]. Tools for converting graphemes to phonemes are also used in theoretical and applied linguistics. Such tools are useful in many areas of linguistic research (e.g., phonetics, phonology, dialectology, and language acquisition), in order to obtain preliminary phonetic transcriptions of large language corpora [4].
The main goal of research on the conversion of graphemes to phonemes is improving speech recognition for the Polish language [5,6]. In addition, research studies have been conducted on the phonetic properties of Polish phonemes [7,8], speech recognition based on such analyses [9], speaker recognition [10,11,12,13,14] and new applications of speech and language processing (e.g., speech translation) [15]. Particularly good results in speech recognition are achieved through the use of statistical language models [16,17,18,19]. However, currently the field of language modelling is shifting from statistical methods to neural networks and deep learning methods [20,21]. The development of statistical and deep learning methods for speech recognition and language modelling requires access to large language corpora, including phonemic corpora [18,19,20,21,22,23,24,25]. The main motivation for undertaking this research on automatic grapheme-to-phoneme conversion and its application, was the development of effective methods of creating a phonemic language corpus for Polish, comprised of phonemic transcriptions derived from an orthographic language corpus through grapheme-to-phoneme conversion.

2. Problem Formulation

The process of converting graphemes to phonemes in orthographic text involves converting a string of orthographic characters into a corresponding string of phonetic transcription characters (representing phonemes or allophones) [2]. A ‘grapheme’ is any of the units of any writing system for any language, a term coined by analogy with the ‘phoneme’ of a spoken language [26]. Graphemes include alphabetic letters, typographic ligatures, numerical digits, punctuation marks, and other individual symbols of writing systems. Since the orthographic text is the only source of pronunciation information in the process of converting graphemes into phonemes, this process must be based on appropriate formal rules, depicting the correct pronunciation of orthographic strings in a given language [27].
Phonemes are usually written in specially designed alphabets. The most widely used alphabet is the International Phonetic Alphabet (IPA) [28]. For the Polish language, as with other Slavic languages, a special transcription system, called the Slavistic Phonetic Alphabet (SPA), is most frequently used [29]. The other very commonly used phonetic alphabet is the Speech Assessment Methods Phonetic Alphabet (SAMPA) [30]. SAMPA is a machine-readable phonetic alphabet, using 7-bit printable ASCII characters, based on the IPA. Table 1 presents the phonemic inventory of Polish with examples, in the SPA, IPA, and SAMPA phonetic alphabets and corresponds to the set of phonemes used for the purpose of this study.
Automatic grapheme-to-phoneme conversion is not a new problem. The first linguist who noted it, and tried to provide a solution for a particular language (Czech), was H. Kučera [31]. Research on solutions to the automatic grapheme-to-phoneme conversion problem have also been initiated for other languages [32,33,34].
In Poland, the first linguist who wrote about the possibility of phonetic interpretation of text by machines was W. Doroszewski in 1969 [35]. The largest contributions to solving the problem of automatic grapheme-to-phoneme conversion for Polish, were the publications of Maria Steffen-Batóg [36,37]. The first implementation of a grapheme-to-phoneme conversion algorithm for Polish, designed for the machine ODRA 1204, was made in 1971 by M. Warmus [38]. Further attempts to implement automatic grapheme-to-phoneme conversion have also been reported in various publications from that time to the present [39,40,41,42,43,44,45].

3. Methodology

Systems for converting graphemes to phonemes can be implemented in many ways, often roughly classified as either dictionary-based or knowledge-based (rule-based) strategies, although there are many intermediate solutions. Data-driven (dictionary-based) solutions involve storing as much phonological knowledge as possible in a lexicon, while rule-based systems consist of rules based on inference or proposed by expert linguists. Both dictionary-based and rule-based methods for converting graphemes to phonemes have their advantages and limitations. Searching for a word in the lexicon is relatively computationally inexpensive, whereas most rule-based system algorithms consume significantly more computational resources to produce a phoneme sequence. In addition, the dictionary method requires a large phonetic dictionary and complex morphophonemic rules, while the rule-based method is unable to model full morphophonemic constraints on its own. Very often, a rule-based system will incorporate a dictionary as an exception list. This solution was used in the grapheme-to-phoneme conversion application presented here.

3.1. Conversion Rules

The first set of rules for the conversion of graphemes into phonemes for the Polish language was developed by Maria Steffen-Batóg and presented in her set of monographs dedicated to automatic grapheme-to-phoneme conversion of texts in Polish [36,37]. Knowledge-based grapheme-to-phoneme conversion, unlike data-driven approaches, exploits rules derived by humans and/or from linguistic studies [46]. The rules for rule-based grapheme-to-phoneme are typically formulated in the framework of finite state automata [47]. The primary advantage of such rule-based approaches is that they can provide complete coverage. To achieve this objective it is crucial to use the algebra of sets and, for this purpose, it is necessary to define the basic sets of graphemes employed. Examples of these basic sets for the Polish language are presented below [37]. One of the basic sets is Z, E, and S. Z is a set of alphabetic characters for Polish, E is a set of three graphemes that are non-alphabetic for Polish, and S is a set of special characters [37]:
Z = { a , ą , b , c , ć , d , e , ę , f , g , h , i , j , k , l , ł , m , n , ń , o , ó , p , r , s , ś , t , u , w , y , z , ź , ż }
E = { q , v , x }
S = { . , ? , ! , , , : , ; , , ( , ) , # , / }
The union of sets Z S will be indicated by Z + S and defined as follows:
Z + S = { α : α Z α S }
Similarly, subtraction of sets Z S will be indicated by Z S . Auxiliary subsets O S and P S are also defined as follows:
O = S { / } = { . , ? , ! , , , : , ; , , ( , ) , # }
P = O { # } = { . , ? , ! , , , : , ; , , ( , ) }
The character “#” is a pause character between the words and the character “/” is a sign to separate characters within the word. The set X is a collection of alphabetic characters in Polish, including special characters:
X = Z + E + S
X = { a , ą , b , c , ć , d , e , ę , f , g , h , i , j , k , l , ł , m , n , ń , o , ó , p , r , s , ś , t , u , w , y , z , ź , ż , q , v , x , . , ? , ! , , , : , ; , , ( , ) , # , / }
The set X does not contain digits because all numbers are always converted to alphabetic representations. Let Γ and Δ be arbitrary sets of words in the X alphabet. The concatenation of the sets Γ and Δ , written as Γ · Δ or Γ Δ is an operation defined as follows:
Γ · Δ = Γ Δ = { α β : α Γ β Δ }
The set of Γ · Δ contains as elements all strings of characters in the form α β for which α Γ and β Δ . For any words α , β , γ , δ we receive:
{ α , β } · { γ , δ } = { α γ , α δ , β γ , β δ }
Here is an example to illustrate this type of relationship:
{ s , c } · A = { s , c } A = { s a , s ą , s e , s ę , s i , s o , s ó , s u , s y , c a , c ą , c e , c ę , c i , c o , c ó , c u , c y }
where A is the set of vowels in Polish:
A = { a , ą , e , ę , i , o , ó , u , y }
It is also necessary to define empty words, written as 1, which do not contain any letters of the X alphabet. In this case, for any word α , equality is fulfilled:
α · 1 = 1 · α = α
The process of automatically converting graphemes to phonemes can be described as a function F defined by the following formula:
F ( α ) = β
where:
α = α 1 . . . α k . . . α a α k X ( 1 k a )
β = β 1 . . . β k . . . β b β k Y ( 1 k b )
and where α is a sequence of a alphabetic characters, β is a sequence of b phonemic characters, b is determined by a, but a need not be equal to b at all. The set Y is the entire set of phonemic characters for Polish described by the SPA alphabet:
Y = { i , y , e , a , o , u , i ̯ , u ̯ , r , l , m , n , ń , , f , v , s , z , š , ž , ś , ź , χ , p , b , t , d , k , g , , ǵ , c , ʒ , č , , ć , }
The grapheme-to-phoneme conversion of correctly written orthographic texts in Polish is the transformation of sequences written in the X alphabet, to a correct form in the phonemic alphabet Y. The grapheme-to-phoneme conversion F function can be described by a set of formal grapheme-to-phoneme conversion rules defining how each sequence α representing a word in the X alphabet, can be transformed into a correct sequence β in the phonemic alphabet defined by the set Y. The conversion rules are usually numerous with varying degrees of complexity. The size and complexity of a set of grapheme-to-phoneme conversion rules depends on the number of letters in the orthographic alphabet and the degree to which each letter can be pronounced differently in various contexts.
The first set of grapheme-to-phoneme conversion rules for Polish was developed by Maria Steffen-Batóg and presented in her set of monographs dedicated to automatic grapheme-to-phoneme conversion of texts in Polish [36,37]. The knowledge contained in these publications was the basis for developing the grapheme-to-phoneme conversion algorithm for Polish implemented by the author in the Python programming language [48]. According to Maria Steffen-Batóg, all grapheme-to-phoneme conversion rules relating to each orthographic letter can be stored in a matrix called the grapheme-to-phoneme conversion rule-table for the letter in question. The general form of a grapheme-to-phoneme conversion rule-table is shown in Table 2 [37].
Table 2 was constructed with the following assumptions [37]:
α k X
1 i n : Γ i X
1 j m : Δ j X
1 i n 1 j m : β i , j Y
The α k letter in the upper left corner of the table indicates that it is the table of grapheme-to-phoneme conversion rules for the orthographic letter α k . The headings of the columns in the table ( Δ 1 , Δ 2 , …, Δ j , …, Δ m 1 , Δ m ) represent various ways of phonetically transcribing the letter α k , depending on the letters following α in context. Similarly, the headings of the rows ( Γ 1 , Γ 2 , …, Γ i , …, Γ n 1 , Γ n ) determine various ways of phonetically transcribing the letter α k , depending on the letters preceding α k in context. At the intersections of the rows and columns, one finds the grapheme-to-phoneme conversions for the letter α k in each specific orthographic context. For example, the intersection of the i-th row and j-th column indicates the element β i , j , which implies the following conversion rule:
  • For each orthographic string s = γ α k δ , where γ Γ i and δ Δ j , the grapheme-to-phoneme conversion process is defined as the assignment α k β i , j .
The symbol β i , j in a specific conversion rule could represent a single phonemic letter, sequence of phonemic letters, or empty letter, written as symbol “1”. It is also worth remembering that certain combinations of orthographic characters in a particular language may be absent from the table. Therefore, there may be some orthographic contexts of the letter α k for which phonetic transcription cannot be determined. In such cases, specific fields in the table are left empty. This principle often allows the simplification of grapheme-to-phoneme conversion rules tables.
One of the simplest grapheme-to-phoneme conversion rules for Polish is the rule for orthographic letter “a” presented in Table 3. According to the grapheme-to-phoneme conversion rule defined in Table 3, each orthographic letter “a”, in all orthographic contexts (before or after any sequence of letters), should be converted into phoneme [a] [37].
An example of a more complex table containing nine grapheme-to-phoneme conversion rules, is the rule-table for letter “ą” presented in Table 4 [37].
The definition of grapheme-to-phoneme conversion rules for letter “ą” requires additional sets Δ 1 , . . . , Δ 9 defined as follows [37]:
Δ 1 = { l , ł , m }
Δ 2 = { p , b }
Δ 3 = { t , k , g }
Δ 4 = c ( X { i , h } )
Δ 5 = d ( X { z , ź } )
Δ 6 = d z ( X i )
Δ 7 = { ć , c i , d ź , d z i }
Δ 8 = { ś , ź , f , w , s , z , ż , c h }
Δ 9 = O
Another example of a more complex rule table is the table for orthographic letter “u”. It consists of five parts, presented in Table 5, Table 6, Table 7 and Table 8. This table contains 88 grapheme-to-phoneme conversion rules [37].
The most complex are the tables of rules for converting graphemes to phonemes in Polish for the following orthographic letters: “z”— 174 rules, “s”— 157 rules, “r”— 91 rules, “u”— 88 rules, and “d”— rules.
As presented in the literature by Steffen-Batóg, the grapheme-to-phoneme conversion rules are clear and universal [37], but they are correct only for Polish. For any other language, it is necessary to define that language’s own grapheme-to-phoneme conversion rules.
On the basis of Batóg’s work [36,37], the author of this paper implemented the grapheme-to-phoneme conversion rules for Polish in the Python programming language [49]. The implementation includes 975 conversion rules covering all 35 letters and special characters [48] of Polish orthography. Although many words have multiple correct pronunciations, the rules implemented here in this way produce only one (the most basic) pronunciation for any given word. The list of implemented grapheme-to-phoneme conversion rules for Polish is presented in Table 9, where: P is a set of special characters excluding { # , @ , , / } , “#” is a pause character at the end of a word, “@” is a pause character at the beginning of a word, “/” is a delimiter character and “-” is a dash character.
The list presented in Table 9 shows that the number of grapheme-to-phoneme conversion rules implemented is 975. The number of cells (potential letter-context combinations) in grapheme-to-phoneme conversion tables is much larger, here 5162. This means that some of them (i.e., 4187 cells) are empty and have no function in the grapheme-to-phoneme conversion process. Empty cells represent letter-context combinations not allowed in Polish.

3.2. Conversion Algorithm

The grapheme-to-phoneme conversion rules are the foundation of the grapheme-to-phoneme conversion algorithm implemented here [48]. The conversion algorithm defines how the function F ( α ) = β works and determines the method for obtaining sequences of phonemic β k letters on the basis of orthographic sequences α k and their contexts. The block diagram of the grapheme-to-phoneme conversion algorithm for a single orthographic letter is presented in Figure 1.
The grapheme-to-phoneme conversion algorithm, using the general form of grapheme-to-phoneme conversion rules shown in Table 2, is shown here in terms of 12 steps:
(1)
Read an input orthographically represented word α , where α = α 1 α 2 α k α n and α 1 , , α n X ;
(2)
At the beginning of the orthographically represented word α , let k = 1 ;
(3)
Read a single letter α k of the input word α ;
(4)
Check the context of the letter α k in the word α :
  • Read a string of letters γ = α 1 α k 1 preceding the orthographic α k letter;
  • Read a string of letters δ = α k + 1 α k + 2 α n following the orthographic α k letter;
(5)
Select the appropriate grapheme-to-phoneme conversion rule table for the orthographic letter α k ;
(6)
Select the appropriate j-th table’s column for the orthographic letter α k , where δ Δ j ;
(7)
Select the appropriate i-th table’s row for the orthographic letter α k , where γ Γ i ;
(8)
Select the appropriate grapheme-to-phoneme conversion rule for the orthographic letter α k , as defined by the cell in the j-th column and i-th row of the table;
(9)
Use the appropriate grapheme-to-phoneme conversion rule for the orthographic letter α k , as defined by the cell in the j-th column and i-th row of the table:
  • In the cell with the table coordinates [ i , j ] , phonemic letter β i , j letter is saved, corresponding to the orthographic letter α k with the following context δ and preceding context γ ;
  • The grapheme-to-phoneme conversion rule for orthographic letter α k will be implemented by assignment α k β k = β i , j ;
(10)
Increment k = k + 1 and read the next orthographic letter α k of the input word α ;
(11)
Go to step number 3, unless  k > n , where n is the length of input word α ;
(12)
Output the result of the grapheme-to-phoneme conversion of orthographic word α = α 1 α n where α 1 , , α n X , is a new phonemic representation of the word β = β 1 β n , where β 1 , , β n Y .
These 12 steps summarize the automatic grapheme-to-phoneme conversion algorithm. This algorithm can also be presented as “pseudocode”, as shown in the list called Algorithm 1.
Algorithm 1. The algorithm for grapheme-to-phoneme conversion (implementation of function F ( α ) = β ).
Require: α = α 1 α n and  α 1 , , α n X
1: n l e n g t h ( α )
2: k 1
3: while ( k n ) do
4:      γ α 1 α k 1
5:      δ α k + 1 α k + 2 α n
6:      t a b l e t a b l e s [ α k ] {% selection of appropriate table}
7:      R r o w s ( t a b l e ) {% number of table rows}
8:      C c o l s ( t a b l e ) {% number of table columns}
9:     for all  ( 2 i R ) do
10:     if ( γ Γ i ) then
11:          I i {% selection of appropriate table row}
12:     end if
13:   end for
14:   for all( 2 j C ) do
15:     if ( δ Δ j ) then
16:          J j {% selection of appropriate table column}
17:     end if
18:   end for
19:    β k t a b l e [ I , J ] {% reading phoneme letter from the table}
20:    k k + 1
21: end while
22: β β 1 β k 1 {% string of phonemes}
22: return  β

4. Results

This grapheme-to-phoneme conversion algorithm for Polish was implemented in the Python programming language as an independent application called TransFon [48]. The developed application allows the grapheme-to-phoneme conversion of any orthographic text file in Polish.
Evaluation of automatic grapheme-to-phoneme conversion implementations is crucial. It is necessary to determine the application’s degree of success (accuracy). The evaluating and testing procedure for this automatic grapheme-to-phoneme conversion implementation consisted of the following elements:
  • Performing grapheme-to-phoneme conversion of an orthographic text corpus file containing the most frequently used words in Polish (those with different orthographic representations), obtained from resources in the National Corpus of Polish [50];
  • Validation and verification of the conversion results for those words using the Polish language dictionary that specifies the correct pronunciation of words in Polish;
  • Registering cases of incorrect conversion, errors, and other problems encountered;
  • Attempts to solve the problems.
This grapheme-to-phoneme conversion application was implemented in such a way that the conversion algorithm stopped when a grapheme-to-phoneme conversion problem occurred (e.g., when there was no rule allowing for a correct phonetic transcription). This made it easier to improve and develop the conversion application. In addition, any doubts about correct pronunciations were resolved with help of the aforementioned dictionary.
In order to evaluate the implemented grapheme-to-phoneme conversion system, the word error rate (WER), and phoneme error rate (PER) were used, as the methods usually used to evaluate the performance of automatic speech recognition systems. The WER value for this grapheme-to-phoneme conversion system was calculated as the ratio of the number of incorrectly converted words to the total number of words converted from the language corpus. The PER parameter was similarly defined and calculated for phonemes.
The causes of problems and errors in the automatic grapheme-to-phoneme conversion operation were as follows:
  • Errors in the implementation of the algorithm or conversion rules;
  • Missing conversion rules in tables (rules not included in the tables) for some orthographic letters in contexts;
  • Problems with conversion of foreign words, acronyms and words that are not in the Polish language dictionary.
These problems were solved in the following ways:
  • Implementation errors in the conversion algorithm and rule tables were corrected by modifying the application source code;
  • The problem of missing conversion rules in tables has been solved by adding rules to the tables. It should be noted that the added grapheme-to-phoneme conversion rules cooperate with the rules implemented earlier and known from the literature [36,37]. In order to complete the rule tables, new rules were added for selected letters e.g.: “i”, “n”, “d”, “z”, “ż”, “ć”, “f”, “s” in some contexts, in particular:
    -
    adding or correcting conversion rules for the letter “i” for correct conversion of words: “unii”, “będzie”, “sobie”, “razie”, “diabeł”, and similar contexts;
    -
    adding or correcting conversion rules for the letter “n” for correct conversion of words: “branży” and similar contexts;
    -
    adding or correcting conversion rules for the letter “d” for correct conversion When the letter “d” occurs at the end of a word, e.g.: “od”, “pod”, “przed”, “nad”, “miliard”, “wyjazd”, “Witold”, “grand”, “rajd”, “hołd”, “prawd”;
    -
    adding or correcting conversion rules for the letter “z” for correct conversion of words: “trzeci”, “zamierza”, “poradzę”, “bezzwrotny”, and similar contexts;
    -
    adding or correcting conversion rules for the letter “ż” for correct conversion of words: “tożsamość”, ”manadżer”, and similar contexts;
    -
    adding or correcting conversion rules for the letter “ć” for correct conversion of words: “ćwiczenia”, “dziećmi”, “zadośćuczynić”, “ćwiartki”, and similar contexts;
    -
    adding or correcting conversion rules for the letter “f” for correct conversion of words: “Afganistan”, “Hoffman”, and similar contexts;
    -
    adding or correcting conversion rules for the letter “s” for correct conversion of words: “przeprosin”, “siną”, “Helsinki”, and similar contexts;
  • The problem of converting foreign words and acronyms was solved by using the developed dictionary in which the rules for converting foreign words and acronyms were defined. Thus, the rule-based grapheme-phoneme conversion was supplemented by the dictionary method;
  • Examples of grapheme-to-phoneme conversion rule-table additions implemented by the author are shown in Table 10, Table 11, Table 12, Table 13, Table 14, Table 15, Table 16 and Table 17.
A number of improvements have made it possible to increase the effectiveness of the grapheme-to-phoneme conversion process.
The results obtained were compared to the results available in the literature [51,52,53,54,55]. Table 18 presents WER and PER values of the developed grapheme-to-phoneme conversion system before improvements. Table 19 presents the WER and PER values of the developed grapheme-to-phoneme conversion system after improvements. The summary of evaluation results for the system developed, before and after improvements, is presented in Table 20. The results presented in Table 20 indicate that conversion performance improved significantly after the implementation of the improvement.
As the final result of improvements to the G2P implementation, the software application called TransFon was developed, which allows automatic grapheme-to-phoneme conversion of any orthographic text file in Polish [48]. Numerous tests of this software application on a large orthographic text corpus [56], in a variety of operating environments, confirmed the accuracy of the application’s operation.
To demonstrate the operation of TransFon application, selected works of Polish literature were subjected to automatic grapheme-to-phoneme conversion. A sample fragment of Adam Mickiewicz’s epic poem “Pan Tadeusz” and a fragment of the output file obtained after automatic grapheme-to-phoneme conversion are presented with the printouts attached to this paper. An example fragment of an input text file containing a orthographic text from the epic poem “Pan Tadeusz” by Adam Mickiewicz is shown in Listing 1 [57].
Listing 1. A sample fragment of the input text file containing orthographic text from Adam Mickiewicz’s epic poem “Pan Tadeusz”.
  • Litwo! Ojczyzno moja! ty jesteś jak zdrowie.
  • Ile cię trzeba cenić, ten tylko się dowie,
  • Kto cię stracił. Dziś piękność twą w całej ozdobie
  • Widzę i opisuję, bo tęsknię po~tobie.
  •          
  • Panno Święta, co jasnej bronisz Częstochowy
  • I w Ostrej świecisz Bramie! Ty, co gród zamkowy
  • Nowogródzki ochraniasz z jego wiernym ludem!
  • Jak mnie dziecko do zdrowia powróciłaś cudem
  • (Gdy od płaczącej matki pod Twoją opiekę
  • Ofiarowany, martwą podniosłem powiekę
  • I zaraz mogłem pieszo do Twych świątyń progu
  • Iść za wrócone życie podziękować Bogu),
  • Tak nas powrócisz cudem na Ojczyzny łono.
  • Tymczasem przenoś moję duszę utęsknioną
  • Do tych pagórków leśnych, do tych łąk zielonych,
  • Szeroko nad błękitnym Niemnem rozciągnionych;
  • Do tych pól malowanych zbożem rozmaitem,
  • Wyzłacanych pszenicą, posrebrzanych żytem;
  • Gdzie bursztynowy świerzop, gryka jak śnieg biała,
  • Gdzie panieńskim rumieńcem dzięcielina pała,
  • A wszystko przepasane, jakby wstęgą, miedzą
  • Zieloną, na niej z rzadka ciche grusze siedzą.
An example fragment of an output file containing phoneme transcriptions of Adam Mickiewicz’s the epic poem “Pan Tadeusz” is shown in Listing 2.
Listing 2: A fragment of the output file containing phonemic notation as a result of automatic grapheme-to-phoneme conversion of Adam Mickiewicz’s epic poem “Pan Tadeusz”.
  • [litvo] [ojtSIzno] [moja] [tI] [jestes’] [jak] [zdrovje]
  • [ile] [ts’e] [tSeba] [tsen’its’] [ten] [tIlko] [s’e] [dovje]
  • [kto] [ts’e] [strats’iw] [dz’is’] [pjenknos’ts’] [tvoe~] [f] [tsawej] [ozdobje]
  • [vidze] [i] [opisuje] [bo] [te~skn’e] [po] [tobje]
  •          
  • [panno] [s’vjenta] [tso] [jasnej] [bron’iS] [tSe~stoxovI]
  • [i] [f] [ostrej] [s’vjets’iS] [bramje] [tI] [tso] [grut] [zamkovI]
  • [novogrutsk’i] [oxran’aS] [s] [jego] [vjernIm] [ludem]
  • [jak] [mn’e] [dz’etsko] [do] [zdrovja] [povruts’iwas’] [tsudem]
  • [gdI] [ot] [pwatSontsej] [matk’i] [pot] [tvojoe~] [opjeke]
  • [ofjarovanI] [martvoe~] [podn’oswem] [povjeke]
  • [i] [zaras] [mogwem] [pjeSo] [do] [tvIx] [s’vjontIn’] [progu]
  • [is’ts’] [za] [vrutsone] [ZIts’e] [podz’enkovats’] [bogu]
  • [tak] [nas] [povruts’iS] [tsudem] [na] [ojtSIznI] [wono]
  • [tImtSasem] [pSenos’] [moje] [duSe] [ute~skn’onoe~]
  • [do] [tIx] [pagurkuf] [les’nIx] [do] [tIx] [wonk] [z’elonIx]
  • [Seroko] [nat] [bwenk’itnIm] [n’emnem] [rosts’ongn’onIx]
  • [do] [tIx] [pul] [malovanIx] [zboZem] [rozmaitem]
  • [vIzwatsanIx] [pSen’itsoe~] [posrebZanIx] [ZItem]
  • [gdz’e] [burStInovI] [s’vjeZop] [grIka] [jak] [s’n’ek] [bjawa]
  • [gdz’e] [pan’en’sk’im] [rumjen’tsem] [dz’en’ts’elina] [pawa]
  • [a] [fSIstko] [pSepasane] [jagbI] [fstengoe~] [mjedzoe~]
  • [z’elonoe~] [na] [n’ej] [s] [Zatka] [ts’ixe] [gruSe] [s’edzoe~]
One measure of the performance of an automatic conversion of graphemes into phonemes algorithm is word processing speed (WPS). The WPS value for the application TransFon was estimated by measuring the time required to convert the test text, Adam Mickiewicz’s epic poem “Pan Tadeusz”, containing 10,972 lines of text and 70,291 words [57]. Automatic grapheme-to-phoneme conversion of the whole text file took about 38 s. It is possible to calculate that the TransFon application can process text files at an average speed of 1849 words per second, which is 110,940 words per minute:
W P S = 70291 [ w o r d s ] 38 [ s e c o n d s ] 1849 w o r d s s e c o n d = 110940 w o r d s m i n u t e
The measurement of grapheme-to-phoneme conversion speed was performed on a PC workstation with CPU Intel Core i7 4770K @ 3.50 GHz. The computational power of this CPU is around 130,000 MIPS.
Another application of the TransFon software can be assistance in language corpora processing. Development of statistical methods for speech recognition requires access to large language corpora [58]. One of the largest corpora available for the Polish language is the National Corpus of Polish (NCP) [56]. The NCP corpus is available to the scientific community, offers great flexibility, and has great scientific value. This corpus provides the scientific community, particularly linguists, as well as computer scientists interested in natural language processing, materials showing the contemporary state of the Polish language,  meeting all the scientific requirements. Word-frequency lists obtained using the National Corpus of Polish seem particularly valuable. Automatic grapheme-to-phoneme conversion allows creating large phonemic language corpora from orthographic language corpora. A phonemic language corpus for Polish was developed by the author using automatic grapheme-to-phoneme conversion of an orthographic language corpus, in order to be able to perform statistical phonological analysis of the Polish language, and to develop phoneme-based statistical language models for Polish to improve automatic speech recognition [17,18,19].
A sample fragment of the frequency list for the phonemic corpus of Polish developed is presented in Listing 3. The phonetic frequency list file contains 19,43,458 Polish words written orthographically, their phonetic transcriptions in the SAMPA phonemic alphabet, and additionally the number of each word occurring in the NCP source corpus. The total number of word-tokens in the NCP corpus is 230,300,300. This describes a plain text version of the NCP corpus file containing 230 million words in Polish. It should also be noted that, the standard SAMPA transcriptions for Polish include several sequences of phonetic transcription labels that may cause ambiguity unless separated by spaces or other characters. To avoid this problem, the individual phonemes were separated by square brackets.
Listing 3. A sample fragment of the frequency list file of the phonemic language corpus developed for Polish.
  • 7692997 w [f]
  • 5333210 i [i]
  • 4235003 na [na]
  • 4158902 z [s]
  • 3981525 się [s’e]
  • 3601719 nie [n’e]
  • 2904114 do [do]
  • 2205896 że [Ze]
  • 2171877 to [to]
  • 1731304 o [o]
  • 1728527 jest [jest]
  • 1425793 a [a]
  • 1003027 jak [jak]
  • 983395 po [po]
  • 912660 od [ot]
  • 877522 ale [ale]
  • 847373 za [za]
  • 775006 przez [pSes]
  • 754024 co [tso]
  • 663771 dla [dla]
  • 645573 czy [tSI]
  • 610035 tym [tIm]
  • 607673 już [juS]
  • 544429 są [soe~]
  • 544343 tak [tak]
  • 534509 tylko [tIlko]
  • 500801 ma [ma]
  • 475172 może [moZe]
  • 451225 tego [tego]
  • 445705 ze [ze]
  • 426201 jego [jego]
  • 413855 oraz [oras]
  • 392445 bo [bo]
  • 390741 które [kture]
  • 386863 będzie [ben’dz’e]
  • 376600 ich [ix]
  • 367282 tej [tej]
  • 359528 było [bIwo]
  • 358750 też [teS]
  • 357026 który [kturI]
  • 349641 jeszcze [jeStSe]
  • 348423 był [bIw]
  • 346507 być [bIts’]
  • 345596 jednak [jednak]
  • 342989 przy [pSI]
Making the developed rule-based implementation available could be very useful to the community—both for linguistic research and for more in-depth performance comparisons between the current rule-based solution and others. Therefore, it is planned to make TransFon application available after making necessary corrections in the implemented algorithm and determining the conditions and license of making the application available for other users.

5. Statistical Approach

Various grapheme-to-phoneme conversion methods can be used, but three types are most important and most often used: rule-based methods, dictionary-based methods, and statistical methods [51].
Rule-based grapheme-to-phoneme conversion methods may be complemented by dictionary-based and statistical methods. The rule-based grapheme-to-phoneme conversion system discussed in this paper was complemented by the dictionary-based method. Statistical methods can also improve grapheme-to-phoneme conversion for Polish in the future. For this purpose, the author performed statistical analysis of the grapheme-to-phoneme conversion rules in Polish.
For better analysis of the obtained results, a special naming scheme for conversion rules was adopted. The naming conventions of the grapheme-to-phoneme conversion rules were as follows:
R i = L _ R _ C _ P
where: R i is the name of the rule, L is an orthographic letter to which the rule applies, R is a row number in the grapheme-to-phoneme conversion table, C is a column number in the grapheme-to-phoneme conversion table, and  P is a phonemic letter, to which the rule applies. A sample list of grapheme-to-phoneme conversion rules for letter “ą” using these naming conventions is presented in Table 21.
The frequency of each R i grapheme-to-phoneme conversion rule was calculated during automatic grapheme-to-phoneme conversion of an orthographic text corpus file in Polish containing 1,843,069,533 letters, obtained from the National Corpus of Polish. A sample list of the most frequently used grapheme-to-phoneme conversion rules in Polish is presented in Table 22.
The frequencies of grapheme-to-phoneme conversion rules used in Polish are presented in Figure 2.
The values presented in Table 22 depend on the frequencies of letters in Polish orthography, which were also calculated for the orthographic language corpus used. The frequencies of the letters of Polish orthography are presented in Table 23 and Figure 3.
The frequencies of words occurring in any language are well described by Zipf’s law [59].
Z r = a r b
where Z r is the frequency of the word ranked, the rank of the word r reflects ranking from most frequent ( r = 1 ) to least frequent ( r = n ) , and a and b are parameters estimated from the statistical data. One usually finds that b is close to 1 [59]. Zipf’s law is not restricted to language [60]. However, Zipf’s law does not well describe the distributions of orthographic letters and phonemes in representations of words. The examination of such frequencies for 95 languages, as found in the literature [60], shows that phonemic and orthographic letter frequencies are best described by an equation first developed by Yule, which also describes the distribution of DNA codons [61]. The frequencies of letters in a language’s orthography are well described by Yule’s equation [60]:
Y r = a r b · c r
where Y r is the frequency of the letter, r is the rank of the letter, if frequencies are ranked from highest ( r = 1 ) to lowest ( r = n ) , and a, b, and c are parameters estimated from the statistical data. The fit of Yule’s equation to the ranked frequency distribution of Polish orthographic letters is presented in Figure 4.
The average fit of Yule’s equation to the ranked frequency distribution of the Polish letters was measured by the coefficient of determination R 2 . The coefficient of determination for the fit of Yule’s equation, presented in Formula (34), correctly yields the ranked frequency distribution of Polish letters:
R 2 = 0.97509
Additionally, the root mean square error R M S E value was calculated for this case:
R M S E = 2.9071 · 10 3
The R 2 value indicates how well statistical data fit into a statistical model. The R 2 value here is equal to 0.97509 and indicates that Yule’s equation fits very well the obtained statistical data on orthographic letter frequencies in Polish. Analogous regularity was observed for the frequency distribution of the grapheme-to-phoneme conversion rules for Polish presented in Table 22. Figure 5 presents the fit of Yule’s equation to the ranked frequency distribution of grapheme-to-phoneme conversion rules used for Polish.
The summary of evaluation results for the fit of Yule’s equation to the ranked frequency distribution of orthographic letters (1) and grapheme-to-phoneme conversion rules in Polish (2) is presented in Table 24.
The values of R 2 presented in Table 24 indicate that Yule’s equation fits the obtained statistical data for the frequencies of orthographic letters and the grapheme-to-phoneme conversion rules in Polish. On this basis, it can be concluded that the data obtained from statistical analysis of grapheme-to-phoneme conversion rules used in Polish, based on an orthographic language corpus, are reliable. The results presented, from the statistical analysis of grapheme-to-phoneme conversion rules in Polish, represent the basis for a statistical approach to Polish G2P in a future implementation.

6. Conclusions

The results of the grapheme-to-phoneme conversion research presented in this paper were compared to other results published in the literature [3,27,39,41,42,43,45,51,53,55,62,63,64,65,66,67,68,69,70,71,72,73]. On the basis of this comparison, the following conclusions can be drawn:
  • Automatic conversion of graphemes into phonemes in orthographic texts is not only a technical issue, consisting in developing appropriate algorithms for converting graphemes into phonemes, but also a serious linguistic problem. Only specialists in linguistics and phonetics of a given language are able to formulate appropriate rules for converting graphemes into phonemes for speech [51];
  • An additional complication is that automatic conversion of graphemes to phonemes is a language-specific problem with different spelling and pronunciation conventions within the same language [55,68,69,70];
  • Effective solutions for automatic grapheme-to-phoneme conversion in one language may not help solve the same problems for a different language. There is not only one language and technical problem of automatic conversion of graphemes to phonemes to be solved, but many different problems with different levels of difficulty that should be solved for each language separately [51];
  • Automatic grapheme-to-phoneme conversion is widely used not only in speech synthesis, but also in speech recognition [3,53];
  • A separate, but very important problem is the evaluation of grapheme-to-phoneme conversion processes [53,71]. Evaluation and validation of grapheme-to-phoneme conversion implementations is a laborious and time-consuming process. All problems registered for the G2P implementation discussed in this paper were positively resolved;
  • The G2P implementation developed for this research is not the only one for Polish [27,39,41,43,45], however only one of the others is available for free use [41];
  • The author of the paper analysed for comparison the only available application for the Polish language, named Transcriber [41]. The application was implemented in the C++ programming language. The implemented method uses a dictionary of 5018 words and 767 defined conversion rules. For comparison, the software presented by the author in this paper was implemented in Python programming language, 975 conversion rules were implemented and the dictionary is very limited and plays only a supporting role. This means that TransFon has implemented 208 more transcription rules, which is over 27% more. The application failed to compile due to the lack of inclusion in the source code of the appropriate libraries that were used by the programmer to create the application. This made it impossible to evaluate the correctness of the application and seriously hindered the comparison with the software created by the author of the paper; However, based on the analysis of the application’s source code, you can see that the principle of the application is also rule-based, but the author of the Transcriber application tried to refine and improve the application’s performance by adding new words to the dictionary (exceptions). The author of the TransFon application, on the other hand, tried to add and supplement transcription rules in a similar way as is known in the literature. This is evidenced by the dictionary size used in both applications;
  • The G2P system presented here could be used for Polish corpus development;
  • The G2P implementation presented here did not exploit any similar pre-existing tools [48];
  • It is worth noting that the solutions presented here for the development of language and speech corpora in Polish are not the only ones and publications on this subject are available [72,73];
  • Of particular interest are the results presented in publications by Grażyna Demenko et al. [39,62,63,64,65,66,67].
This paper presents a rule-based method for grapheme-to-phoneme conversion and an implementation for Polish. The major original author’s achievements presented in this paper are as follows:
  • Implementation of the known from the linguistic literature rules of converting graphemes into phonemes for the Polish language in the Python programming language [36,37,40,44];
  • Developing an algorithm for automatic conversion of graphemes into phonemes for the Polish language and implementing it in the Python programming language with numerous improvements;
  • Development of a software for automatic conversion of graphemes into phonemes called TransFon, which enables automatic conversion of graphemes into phonemes of any orthographic text files in the Polish language;
  • Application of the developed methods to create phoneme-based language corpora using the automatic conversion of graphemes to phonemes;
  • Statistical analysis of the occurrence frequency of particular grapheme-to-phoneme conversion rules in Polish;
  • Comparison of the results obtained with those published in the literature and discussion.
It should be noted that the research presented in this article used basic principles and the fundamental grapheme-to-phoneme conversion rules developed by Maria Steffen-Batóg and presented in her set of monographs dedicated to the automatic grapheme-to-phoneme conversion of texts in Polish [36,37]. The author used previously developed rules and developed independently the grapheme-to-phoneme conversion algorithm and software application. The application allows to convert any text in Polish orthography to the corresponding strings of phonemes as well as creating large phonemic language corpora based on orthographic language corpora.
The system for rule-based grapheme-to-phoneme conversion implemented here is complemented by dictionary-based methods, and was used to obtain statistics for the use of grapheme-to-phoneme conversion rules in Polish, potentially enabling the improvement of grapheme-to-phoneme conversion for Polish in the future.
The grapheme-to-phoneme conversion system developed and its ability to create phonemic language corpora for Polish open up further opportunities for research on improving automatic speech recognition in Polish. The plan for further research towards achieving this goal, using the phonemic language corpus developed, includes:
  • Performing a better and more detailed statistical analysis of the Polish language based on the phonemic language corpus developed [17,19];
  • Developing more efficient word-based and phoneme-based statistical language models for speech recognition applications in Polish [18,19];
  • Application of deep learning methods to language modelling and speech recognition [20,21].
The main problem in the development of phoneme-based statistical language models for Polish is the difficulty in obtaining sufficiently large phonemic language corpora. The phonemic language corpus development method presented in this paper, based on automatic grapheme-to-phoneme conversion, can significantly remedy this problem.

Funding

This work was supported by the Polish Ministry of Science and Higher Education funding for statutory activities on Silesian University of Technology.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The author declares no conflict of interest.

References

  1. Hirschberg, J.; Manning, C.D. Advances in natural language processing. Science 2015, 349, 261–266. [Google Scholar]
  2. Lee, F. Automatic grapheme-to-phone translation of english. J. Acoust. Soc. Am. 1967, 41, 1594A. [Google Scholar]
  3. Bagshaw, P. Phonemic transcription by analogy in text-to-speech synthesis: Novel word pronunciation and lexicon compression. Comput. Speech Lang. 1998, 12, 119–142. [Google Scholar]
  4. Kawaguchi, Y.; Takagaki, T.; Tomimori, N.; Tsuruga, Y. Corpus-Based Perspectives in Linguistics. In Usage-Based Linguistic Informatics; John Benjamins Publishing Company: Amsterdam, The Netherlands, 2007. [Google Scholar]
  5. Kłosowski, P. Speech Processing Application Based on Phonetics and Phonology of the Polish Language. In Communications in Computer and Information Science, Proceedings of the 17th International Conference Computer Networks, Ustron, Poland, 15–19 June 2010; Kwiecien, A., Gaj, P., Stera, P., Eds.; Computer Nerworks; Springer: Berlin, Germany, 2010; Volume 79, pp. 236–244. [Google Scholar]
  6. Kłosowski, P. Improving speech processing based on phonetics and phonology of Polish language. Prz. Elektrotech. 2013, 89, 303–307. [Google Scholar]
  7. Izydorczyk, J.; Kłosowski, P. Acoustic properties of Polish vowels. Bull. Polish Acad. Sci. Tech. Sci. 1999, 47, 29–37. [Google Scholar]
  8. Izydorczyk, J.; Kłosowski, P. Base acoustic properties of Polish speech. In Proceedings of the International Conference Programable Devices and Systems PDS2001 IFAC Workshop (IFAC 2001), Gliwice, Poland, 22–23 November 2001; pp. 61–66. [Google Scholar]
  9. Kłosowski, P.; Dustor, A.; Izydorczyk, J.; Kotas, J.; Slimok, J. Speech Recognition Based on Open Source Speech Processing Software. In Communications in Computer and Information Science, Proceedings of the 21st International Science Conference on Computer Networks (CN), Brunow, Poland, 23–27 June 2014; Kwiecien, A., Gaj, P., Stera, P., Eds.; Computer Networks, CN; Springer: Berlin, Germany, 2014; Volume 431, pp. 308–317. [Google Scholar]
  10. Dustor, A.; Kłosowski, P. Biometric Voice Identification Based on Fuzzy Kernel Classifier. In Communications in Computer and Information Science, Proceedings of the 20th International Conference on Computer Networks (CN), Lwowek Slaski, Poland, 17–21 Jun 2013; Kwiecien, A., Gaj, P., Stera, P., Eds.; Computer Networks, CN; Springer: Berlin, Germany, 2013; Volume 370, pp. 456–465. [Google Scholar]
  11. Dustor, A.; Kłosowski, P.; Izydorczyk, J. Influence of Feature Dimensionality and Model Complexity on Speaker Verification Performance. In Communications in Computer and Information Science, Proceedings of the 21st International Science Conference on Computer Networks (CN), Brunow, Poland, 23–27 June 2014; Kwiecien, A., Gaj, P., Stera, P., Eds.; Computer Networks, CN; Springer: Berlin, Germany, 2014; Volume 431, pp. 177–186. [Google Scholar]
  12. Dustor, A.; Kłosowski, P.; Izydorczyk, J. Speaker recognition system with good generalization properties. In Proceedings of the 2014 International Conference on Multimedia Computing and Systems (ICMCS), Marrakech, Morocco, 14–16 April 2014; IEEE: Piscataway, NJ, USA, 2014; pp. 206–210. [Google Scholar]
  13. Dustor, A.; Kłosowski, P.; Izydorczyk, J.; Kopanski, R. Influence of Corpus Size on Speaker Verification. In Communications in Computer and Information Science, Proceedings of the 22nd International Conference on Computer Networks (CN), Brunow, Poland, 16–19 June 2015; Gaj, P., Kwiecien, A., Stera, P., Eds.; Computer Networks, CN; Springer: Berlin, Germany, 2015; Volume 522, pp. 242–249. [Google Scholar]
  14. Kłosowski, P.; Dustor, A.; Izydorczyk, J. Speaker verification performance evaluation based on open source speech processing software and timit speech corpus. In Communications in Computer and Information Science, Proceedings of the 22nd International Conference on Computer Networks (CN), Brunow, Poland, 16–19 June 2015; Gaj, P., Kwiecien, A., Stera, P., Eds.; Computer Networks, CN; Springer: Berlin, Germany, 2015; Volume 522, pp. 400–409. [Google Scholar]
  15. Kłosowski, P.; Dustor, A. Automatic Speech Segmentation for Automatic Speech Translation. In Communications in Computer and Information Science, Proceedings of the 20th International Conference on Computer Networks (CN), Lwowek Slaski, Poland, 17–21 June 2013; Kwiecien, A., Gaj, P., Stera, P., Eds.; Computer Networks, CN; Springer: Berlin, Germany, 2013; Volume 370, pp. 466–475. [Google Scholar]
  16. Bellegarda, J.R.; Monz, C. State of the art in statistical methods for language and speech processing. Comput. Speech Lang. 2016, 35, 163–184. [Google Scholar]
  17. Kłosowski, P. Statistical analysis of Polish language corpus for speech recognition application. In Proceedings of the 20th IEEE International Conference Signal Processing Algorithms, Architectures, Arrangements, and Applications, Poznań, Poland, 21–23 September 2016; pp. 304–309. [Google Scholar]
  18. Kłosowski, P. Polish language modelling for speech recognition application. In Proceedings of the 21th IEEE International Conference Signal Processing Algorithms, Architectures, Arrangements, and Applications, Poznan, Poland, 20–22 September 2017; pp. 313–318. [Google Scholar]
  19. Kłosowski, P. Statistical analysis of orthographic and phonemic language corpus for word-based and phoneme-based Polish language modelling. EURASIP J. Audio Speech Music Process. 2017, 2017, 5. [Google Scholar]
  20. Kłosowski, P. Deep learning for natural language processing and language modelling. In Proceedings of the 22th IEEE International Conference Signal Processing Algorithms, Architectures, Arrangements, and Applications, Poznan, Poland, 19–21 September 2018; pp. 223–228. [Google Scholar]
  21. Kłosowski, P. Polish language modelling based on deep learning methods and techniques. In Proceedings of the 23th IEEE International Conference Signal Processing Algorithms, Architectures, Arrangements, and Applications, Poznan, Poland, 18–20 September 2019; pp. 223–228. [Google Scholar]
  22. Adda-Decker, M. Corpus for automatic speech recognition. Rev. Fr. Linguist. Appl. 2007, 12, 71–84. [Google Scholar]
  23. Drgas, S.; Dabrowski, A. Speaker recognition based on multilevel speech signal analysis on Polish corpus. Multimed. Tools Appl. 2015, 74, 4195–4211. [Google Scholar]
  24. Furui, S. Recent progress in corpus-based spontaneous speech recognition. IEICE Trans. Inf. Syst. 2005, 88, 366–375. [Google Scholar]
  25. Lecouteux, B.; Linares, G.; Oger, S. Integrating imperfect transcripts into speech recognition systems for building high-quality corpora. Comput. Speech Lang. 2012, 26, 67–89. [Google Scholar]
  26. Coulmas, F. The Blackwell’s Encyclopedia of Writing Systems; Blackwells: Oxford, UK, 1996. [Google Scholar]
  27. Przybysz, P.; Kasprzak, W. The generation of letter-to-sound rules for grapheme-to-phoneme conversion. In Proceedings of the 2013 6th International Conference on Human System Interactions (HSI), Sopot, Poland, 6–8 June 2013; pp. 292–297. [Google Scholar]
  28. International Phonetic Association. Handbook of the International Phonetic Association: A Guide to the Use of the International Phonetic Alphabet; A Regents Publication; Cambridge University Press: Cambridge, UK, 1999. [Google Scholar]
  29. Sussex, R.; Cubberley, P. The Slavic Languages. In Cambridge Language Surveys; Cambridge University Press: Cambridge, UK, 2006. [Google Scholar]
  30. Wells, J. SAMPA computer readable phonetic alphabet. In Handbook of Standards and Resources for Spoken Language Systems, Vol. Part IV, Section B; Gibbon, D., Moore, R., Winski, R., Eds.; Mouton de Gruyter: Berlin, Germany; New York, NY, USA, 1997. [Google Scholar]
  31. Kučera, H. Mechanical phonemic transcription and phoneme frequency count of czech. Int. J. Slav. Lingguistic Phon. 1963, 6, 36–50. [Google Scholar]
  32. Bhimani, B.; Dolby, J. Acoustic phonetic transcription of written English. In Annual Report: Automatic Indexing and Abstracting; AIP Publishing: Palo Alto, CA, USA, 1966. [Google Scholar]
  33. Pratt, B.; Silva, G. Phontrns: A Procedure which Uses a Computer for Transcribing French Text info Phonetic Symbols; Monash University: Melbourne, Australia, 1967. [Google Scholar]
  34. Ungerhruer, G.; Kästner, W. Untersuchungen zur Transformation Deutcher Schirifttexte in Entsprechende Phonemtexte mit Hilfe Elektronischer Rechenmaschinen; Forschungsbericht; Institut für Phonetic und Kommunikationsforshung der Universität Bonn: Bonn, Germany, 1966. [Google Scholar]
  35. Doroszewski, W. Speech and writing (in Polish: Mowa a pismo). Porad. Jęz. 1969, 4, 181–188. [Google Scholar]
  36. Steffen-Batóg, M. The problem of automatic phonemic transcription of written Polish. Biul. Fonogr. 1973, XIV, 75–86. [Google Scholar]
  37. Steffen-Batóg, M. Automatic Phonemic Transcription of Polish Texts (In Polish: Automatyzacja Transkrypcji Fonematycznej Tekstów Polskich); Wydawnictwo Naukowe PWN: Warszawa, Poland, 1975. [Google Scholar]
  38. Warmus, M. Software implementation for ODRA 1204 of automatic phonemic transctiption of polish texts (in Polish: Program na maszynę ODRA 1204 dla automatycznej transkrypcji fonematycznej tekstów języka polskiego). In Zastosowanie Maszyn Matematycznych do Badań nad Językiem Naturalnym; Bolc, L., Ed.; Wydawnictwo Uniwersytetu Warszawskiego: Warszawa, Poland, 1973. [Google Scholar]
  39. Demenko, G.; Wypych, M.; Baranowska, E. Implementation of grapheme-to-phoneme rules and extended SAMPA alphabet in Polish text-to-speech synthesis. Speech Lang. Technol. 2003, 7, 79–97. [Google Scholar]
  40. Jassem, W. A phonemic transcription and syllable division rule engine. In Onomastica-Copernicus Research Colloquium; University of Edinburg: Edinburgh, UK, 1996. [Google Scholar]
  41. Koržinek, D.; Brocki, Ł.; Marasek, K. Polish Grapheme-to-Phoneme Tool and Service, CLARIN-PL Digital Repository (2016). Available online: https://clarin-pl.eu/dspace/handle/11321/295 (accessed on 10 January 2022).
  42. Koržinek, D.; Marasek, K.; Brocki, Ł.; Wołk, K. Polish read speech corpus for speech tools and services. In CLARIN Common Language Resources and Technology Infrastructure, Proceedings of the Selected Papers from the CLARIN Annual Conference 2016, Aix-en-Provence, France, 26–28 October 2016; Number 136; Linköping University Electronic Press, Linköpings Universitet: Linköpings, Sweden, 2017; pp. 54–62. [Google Scholar]
  43. Skurzok, D.; Ziółko, B.; Ziółko, M. Ortfon2—Tool for orthographic to phonetic transcription. In Proceedings of the 7th Language & Technology Conference, Poznań, Poland, 27–29 November 2015. [Google Scholar]
  44. Steffen-Batóg, M.; Nowakowski, P. An algorithm for phonetic transcription of orthographic texts in Polish. In Studia Phonetica Posnaniensia; Steffen-Batóg, M., Awedyk, W., Eds.; Wydawnictwo Naukowe UAM: Poznań, Poland, 1993; Volume 3. [Google Scholar]
  45. Wypych, M. Implementation of phonenic transcription alghorithm (in Polish: Implementacja algorytmu transkrypcji fonematycznej). In Speech and Language Technology; Polskie Towarzystwo Fonetyczne: Poznań, Poland, 1999; Volume 3. [Google Scholar]
  46. Razavi, M.; Rasipuram, R.; Doss, M.M. Acoustic data-driven grapheme-to-phoneme conversion in the probabilistic lexical modeling framework. Speech Commun. 2016, 82, 1–21. [Google Scholar]
  47. Kaplan, R.M.; Kay, M. Regular models of phonological rule systems. Comput. Linguist. 1994, 20, 331–378. [Google Scholar]
  48. Kłosowski, P. Algorithm and implementation of automatic phonemic transcription for Polish. Proceedings of 20th IEEE International Conference Signal Processing Algorithms, Architectures, Arrangements, and Applications, Poznań, Poland, 21–23 September 2016; pp. 298–303. [Google Scholar]
  49. Python Software Foundation: About Python (2014). Available online: https://www.python.org/about/ (accessed on 10 January 2022).
  50. Przepiórkowski, A.; Bańko, M.; Górski, R.L. Lewandowska-Tomaszczyk, B. The National Corpus of Polish (in Polish: Narodowy Korpus Języka Polskiego); Wydawnictwo Naukowe PWN: Warszawa, Poland, 2012. [Google Scholar]
  51. Auzina, I.; Pinnis, M.; Dargis, R. Comparison of Rule-based and Statistical Methods for Grapheme to Phoneme Modelling. In Frontiers in Artificial Intelligence and Applications, Proceedings of the Human Language Technologies—The Baltic Perspective, Baltic HLT 2014, Kaunas, Lithuania, 26–27 September 2014; Utka, A., Grigonyte, G., Kapociute Dzikiene, J., Vaicenoniene, J., Eds.; Vytautas Magnus University ViaConventus: Vilnius, Lithuania, 2014; Volume 268, pp. 57–60. [Google Scholar]
  52. Decadt, B.; Duchateau, J.; Daelemans, W.; Wambacq, P. Phoneme-to-grapheme conversion for out-of-vocabulary words in large vocabulary speech recognition. In Proceedings of the IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU ’01), Madonna di Campiglio, Italy, 9–13 December 2001; pp. 413–416. [Google Scholar]
  53. Jouvet, D.; Fohr, D.; Illina, I. Evaluating grapheme-to-phoneme converters in automatic speech recognition context. In Proceedings of the 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Kyoto, Japan, 25–30 March 2012; pp. 4821–4824. [Google Scholar]
  54. Kheang, S.; Katsurada, K.; Iribe, Y.; Nitta, T. Novel two-stage model for grapheme-to-phoneme conversion using new grapheme generation rules. In Proceedings of the 2014 International Conference of Advanced Informatics: Concept, Theory and Application (ICAICTA), Bandung, Indonesia, 20–21 August 2014; pp. 97–102. [Google Scholar]
  55. Schlippe, T.; Ochs, S.; Schultz, T. Grapheme-to-phoneme model generation for indo-european languages. In Proceedings of the 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Kyoto, Japan, 25–30 March 2012; pp. 4801–4804. [Google Scholar]
  56. Przepiórkowski, A.; Górski, R.L.; Lewandowska-Tomaszczyk, B.; Aziński, M. Towards the national corpus of Polish. In Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC 2008), Marrakech, Morocco, 28–30 May 2008; European Language Resources Association (ELRA): Paris, France, 2008. [Google Scholar]
  57. Mickiewicz, A. Pan Tadeusz, Czyli, Ostatni Zajazd na Litwie: Historja Szlachecka z r. 1811 i 1812, We Dwunastu Księgach, Wierszem; Wydawnictwo Zakładu Narodowego Im. Ossolińskich: Warszawa, Poland, 1834; Available online: https://wolnelektury.pl/katalog/lektura/pan-tadeusz.html (accessed on 10 January 2022).
  58. Ney, H. Corpus-based statistical methods in speech and language processing. In Text, Speech and Language Technology, Proceedings of the 2nd European Summer School on Language and Speech Communication, Utrecht, The Netherlands, 1994; Corpus-Based Methods in Language and Speech Processing; Young, S., Bloothooft, G., Eds.; Kluwer Academic Publishers: London, UK, 1997; Volume 2, pp. 4–26. [Google Scholar]
  59. Zipf, G.K. Human behavior and the principle of least effort. J. Clin. Psychol. 1950, 6, 306. [Google Scholar]
  60. Tambovtsev, Y.; Martindale, C. Phoneme Frequencies Follow a Yule Distribution. SKASE J. Theor. Linguist. 2008, 4, 1–11. [Google Scholar]
  61. Yule, G.U. A mathematical theory of evolution, based on the conclusions of Dr. J. C. Willis, F.R.S. Philos. Trans. R. Soc. Lond. B Biol. Sci. 1925, 213, 21–87. [Google Scholar]
  62. Cylwik, N.; Wagner, A.; Demenko, G. The euronounce corpus of non-native polish for asr-based pronunciation tutoring system. In Proceedings of the 2nd ISCA Workshop of Speech and Language Technology in Education, Warwickshire, UK, 3–5 September 2009. [Google Scholar]
  63. Demenko, G. Korpusowe Badania JęZyka MóWionego; Akademicka Oficyna Wydawnicza EXIT: Warszawa, Polish, 2015; ISBN 9788378370437. [Google Scholar]
  64. Demenko, G.; Bachan, J.; Wagner, A.; Wyroślak, P. Speech corpus creation for automatic analysis of phonetic convergence. In Studientexte zur Sprachkommunikation, Proceedings of 27th Conference on Electronic Speech Signal Processing (ESSV), Leipzig, Germany, 2–4 March 2016; Oliver, J., Ed.; Hochschule für Telekommunikation Leipzig (HfTL): Leipzig, Germany, 2016; pp. 183–190. [Google Scholar]
  65. Demenko, G.; Grocholewski, S.; Klessa, K.; Rau, Z. Polish language resources for speech technology: Jurisdic lvcsr corpora. In Human Language Technologies as a Challenge for Computer Science and Linguistics, Proceedings of the 4th Language & Technology Conference, Poznań, Poland, 6–8 November 2009; Zygmunt, V., Ed.; Adam Mickiewicz University: Poznań, Poland, 2009; pp. 165–169. [Google Scholar]
  66. Demenko, G.; Klessa, K.; Szymański, M.; Breuer, S.; Hess, W. Polish unit selection speech synthesis with boss: Extensions and speech corpora. Int. J. Speech Technol. 2010, 13, 85–99. [Google Scholar]
  67. Demenko, G.; Szymański, M.; Cecko, R.; Lange, M.; Klessa, K.; Owsianny, M. Development of large vocabulary continuous speech recognition using phonetically structured speech corpus. In Proceedings of the 17th International Congress of Phonetic Sciences (ICPhS XVII), Hong Kong, China, 17–21 August 2011; pp. 568–571. [Google Scholar]
  68. Kosaner, O.; Birant, C.C.; Aktas, O. Improving Turkish language training materials: Grapheme-to-phoneme conversion for adding phonemic transcription into dictionary entries and course books. In Procedia Social and Behavioral Sciences, Proceedings of the 13th International Educational Technology Conference, Lisbon, Portugal, 30 October–1 November 2014; Isman, A., Siraj, S., Kiyici, M., Eds.; Volume 103, pp. 473–484.
  69. Lee, J.; Kim, B.; Lee, G.G. Hybrid Approach to Grapheme to Phoneme Conversion for Korean. In Proceedings of the InterSpeech 2009: 10th Annual Conference of the International Speech Communication Association 2009, Brighton, UK, 6–10 September 2009; Volume 1–5, pp. 1299–1302. [Google Scholar]
  70. de Jesus Aguiar Pontes, J.; Furui, S. Predicting the phonetic realizations of word-final consonants in context—A challenge for French grapheme-to-phoneme converters. Speech Commun. 2010, 52, 847–862. [Google Scholar]
  71. Schraagen, M.; Bloothooft, G. A qualitative evaluation of phoneme-to-phoneme technology. In Proceedings of the 12th Annual Conference of the International-Speech-Communication-Association 2011 (Interspeech 2011), Florence, Italy, 27–31 August 2011; Volume 1–5, pp. 2332–2335. [Google Scholar]
  72. Żelasko, P.; Ziółko, B.; Jadczyk, T.; Skurzok, D. AGH corpus of Polish speech. Lang. Resour. Eval. 2016, 50, 585–601. [Google Scholar]
  73. Ziółko, B.; Jadczyk, T.; Skurzok, D.; Żelasko, P.; Gałka, J.; Pȩdzima̧ż, T.; Gawlik, I.; Pałka, S. SARMATA 2.0 automatic Polish language speech recognition system. In Proceedings of the Sixteenth Annual Conference of the International Speech Communication Association, Dresden, Germany, 6–10 September 2015; Interspeech: Dresden, Germany, 2015. [Google Scholar]
Figure 1. The block diagram of a grapheme-to-phoneme conversion algorithm for a single orthographic letter.
Figure 1. The block diagram of a grapheme-to-phoneme conversion algorithm for a single orthographic letter.
Applsci 12 02758 g001
Figure 2. The frequencies of grapheme-to-phoneme conversion rules used in Polish.
Figure 2. The frequencies of grapheme-to-phoneme conversion rules used in Polish.
Applsci 12 02758 g002
Figure 3. The frequencies of letters in Polish orthography.
Figure 3. The frequencies of letters in Polish orthography.
Applsci 12 02758 g003
Figure 4. The fit of Yule’s equation to the ranked frequency distribution of the Polish orthographic letters.
Figure 4. The fit of Yule’s equation to the ranked frequency distribution of the Polish orthographic letters.
Applsci 12 02758 g004
Figure 5. The fit of Yule’s equation to the ranked frequency distribution of grapheme-to-phoneme conversion rules used for Polish.
Figure 5. The fit of Yule’s equation to the ranked frequency distribution of grapheme-to-phoneme conversion rules used for Polish.
Applsci 12 02758 g005
Table 1. The set of Polish phonemes with examples, written in the SPA, IPA, and SAMPA phonetic alphabets, which corresponds to the set of phonemes used for the purpose of this study.
Table 1. The set of Polish phonemes with examples, written in the SPA, IPA, and SAMPA phonetic alphabets, which corresponds to the set of phonemes used for the purpose of this study.
Phonetic AlphabetExample of
No. SymbolsOccurrence
[SPA][IPA][SAMPA]in Polish
1[e][ɛ][e]serce
2[a][ɑ][a]baba
3[o][ɔ][o]oko
4[t][t][t]trawa
5[n][n][n]noc
6[y][ɨ][I]syty
7[i̯][j][j]jajo
8[i][i][i]wici
9[r][r][r]rok
10[s][s][s]sok
11[v][v][v]wada
12[p][p][p]praca
13[u][u][u]buk
14[m][m][m]mama
15[k][k][k]kot
16[ń][ɲ][n’]koń
17[d][d][d]dudek
18[l][l][l]lato
19[u̯][ɫ][w]łysy
20[š][ʃ][S]szyszka
21[f][f][f]fala
22[z][z][z]koza
23[c][ʦ͡][ts]cacko
24[b][b][b]baba
25[g][g][g]godło
26[ś][ɕ][s’]siano
27[ć][ʨ͡][ts’]ciasto
28[][ʝ][x]higiena
29[č][ʧ͡][tS]czarny
30[ž][ʒ][Z]każdy
31[][][e ]ręka
32[ḱ][c][k’]kino
33[][ʥ͡][dz’]dziedzic
34[ʒ][ʣ͡][dz]nadzy
35[ź][ʑ][z’]ziarno
36[ǵ][ɟ][g’]magiczny
37[][ʤ͡][dZ]droże
Table 2. The general form of a grapheme-to-phoneme conversion rule-table [37].
Table 2. The general form of a grapheme-to-phoneme conversion rule-table [37].
α k Δ 1 Δ j Δ m 1 Δ m
Γ 1 β 1 , 1 β 1 , j β 1 , m 1 β 1 , m
Γ 2 β 2 , 1 β 2 , j β 2 , m 1 β 2 , m
Γ i β i , 1 β i , j β i , m 1 β i , m
Γ n 1 β n 1 , 1 β n 1 , j β n 1 , m 1 β n 1 , m
Γ n β n , 1 β n , j β n , m 1 β n , m
Table 3. The grapheme-to-phoneme conversion rule-table for the letter “a” in Polish [37].
Table 3. The grapheme-to-phoneme conversion rule-table for the letter “a” in Polish [37].
aX
Xa
Table 4. The grapheme-to-phoneme conversion rule-table for letter “ą” in Polish [37].
Table 4. The grapheme-to-phoneme conversion rule-table for letter “ą” in Polish [37].
ą Δ 1 Δ 2 Δ 3 Δ 4 Δ 5 Δ 6 Δ 7 Δ 8 Δ 9
Xoomonononon
Table 5. The grapheme-to-phoneme conversion rule-table for orthographic letter “u” in Polish—part 1 [37].
Table 5. The grapheme-to-phoneme conversion rule-table for orthographic letter “u” in Polish—part 1 [37].
uX - //
X - {a, e}u
rzeu
ieu
pozau
prau
dnau
unau
enau
#nau
Table 6. The grapheme-to-phoneme conversion rule-table for letter “u” in Polish—part 2 [37].
Table 6. The grapheme-to-phoneme conversion rule-table for letter “u” in Polish—part 2 [37].
uX
(X - {e, o, #})za
(X - p){oza, ra}
(X - {d, u, e, o, a, y, #})na
Table 7. The grapheme-to-phoneme conversion rule-table for letter “u” in Polish—part 3 [37].
Table 7. The grapheme-to-phoneme conversion rule-table for letter “u” in Polish—part 3 [37].
ukwłczsz{s, m}Ot
(X - {z, i})euu
(X - r)zeuu
(X -{z, r, n})au
#zauuuuuu
ezau
onauu
anau
ynau>u̯>u̯
Table 8. The grapheme-to-phoneme conversion rule-table for letter “u” in Polish—part 4 [37].
Table 8. The grapheme-to-phoneme conversion rule-table for letter “u” in Polish—part 4 [37].
u X { t , k , w , ł , c , s , m } c ( X z ) s ( X ( O + z ) ) m ( X O )
(X - {z, i})e)
(X - r)ze
(X -{z, r, n})a
#zauuuu
eza
ona
ana
yna
Table 9. The list of implemented grapheme-to-phoneme conversion rules for Polish.
Table 9. The list of implemented grapheme-to-phoneme conversion rules for Polish.
Table NumberOrthographic LetterNo. of RowsNo. of ColumnsNo. of CellsNo. of Rules
1z34642079174
2s1762976157
3r112221091
4u151519688
5d124649583
7i141315680
6ć121919840
8ż131618036
9c112827036
10t62612531
11k4174818
12f6136016
13w2151414
14ź391613
15ś6104512
16p2121111
17b2111010
18g21099
19ę21099
20ą21099
21n39168
22y4366
23l3344
24P3344
25h2211
26ł2211
27ń2211
28m2211
29j2211
30ó2211
31o2211
32e2211
33a2211
34q2211
35v2211
36x2211
37#2211
38@2211
39-2211
40/2211
TOTAL5162975
Table 10. The grapheme-to-phoneme conversion rule-table addition for the letter “i” in Polish for words: “unii”, “będzie”, “sobie”, “razie”, “diabeł”, and similar contexts.
Table 10. The grapheme-to-phoneme conversion rule-table addition for the letter “i” in Polish for words: “unii”, “będzie”, “sobie”, “razie”, “diabeł”, and similar contexts.
iA-{i}Aeeab
{c,n}1
dz 1
b j
z 1
d j
Table 11. The grapheme-to-phoneme conversion rule-table addition for the letter “n” in Polish for words: “branży” and similar contexts.
Table 11. The grapheme-to-phoneme conversion rule-table addition for the letter “n” in Polish for words: “branży” and similar contexts.
nż
Xŋ
Table 12. The grapheme-to-phoneme conversion rule-table addition for the letter “d” in Polish for words: “od”, “pod”, “przed”, “nad”, “miliard”, “wyjazd”, “Witold”, “grand”, “rajd”, “hołd”, “prawd”, and similar contexts.
Table 12. The grapheme-to-phoneme conversion rule-table addition for the letter “d” in Polish for words: “od”, “pod”, “przed”, “nad”, “miliard”, “wyjazd”, “Witold”, “grand”, “rajd”, “hołd”, “prawd”, and similar contexts.
dS
At
{r, z, l, n, ł, w}t
Table 13. The grapheme-to-phoneme conversion rule-table addition for the letter “z” in Polish for words: ”trzeci”, “zamierza”, “poradzę”, “bezzwrotny”, and similar contexts.
Table 13. The grapheme-to-phoneme conversion rule-table addition for the letter “z” in Polish for words: ”trzeci”, “zamierza”, “poradzę”, “bezzwrotny”, and similar contexts.
zXa{ą,ę}z
tr1
r 1
d 1
e s
Table 14. The grapheme-to-phoneme conversion rule-table addition for the letter “ż” in Polish for words “tożsamość”, “manadżer”, and similar contexts.
Table 14. The grapheme-to-phoneme conversion rule-table addition for the letter “ż” in Polish for words “tożsamość”, “manadżer”, and similar contexts.
żsaer
Až
d 1
Table 15. The grapheme-to-phoneme conversion rule-table addition for the letter “ć” in Polish for words: “ćwiczenia”, “dziećmi”, “zadośćuczynić”, ”ćwiartki”, and similar contexts.
Table 15. The grapheme-to-phoneme conversion rule-table addition for the letter “ć” in Polish for words: “ćwiczenia”, “dziećmi”, “zadośćuczynić”, ”ćwiartki”, and similar contexts.
ćwicmuwi · A
Xć ć
e ć
ś ć
Table 16. The grapheme-to-phoneme conversion rule-table addition for the letter “f” in Polish for words: “Afganistan”, “Hoffman”, and similar contexts.
Table 16. The grapheme-to-phoneme conversion rule-table addition for the letter “f” in Polish for words: “Afganistan”, “Hoffman”, and similar contexts.
fgaf
Xvf
Table 17. The grapheme-to-phoneme conversion rule-table addition for the letter “s” in Polish for words: “przeprosin”, “siną”, “Helsinki”, and similar contexts.
Table 17. The grapheme-to-phoneme conversion rule-table addition for the letter “s” in Polish for words: “przeprosin”, “siną”, “Helsinki”, and similar contexts.
sininąin
Aś
#śś
l s
Table 18. The WER and PER values of the developed grapheme-to-phoneme conversion system before improvements.
Table 18. The WER and PER values of the developed grapheme-to-phoneme conversion system before improvements.
No.ParameterValue
1Number of unique words checked1,943,458
2Number of G2P conversion errors for unique words33,638
3The WER value for unique words1.731%
4Number of words in the corpus230,300,300
5Number of G2P conversion errors for words in the corpus3,707,890
6The WER value for the corpus1.610%
7Number of checked unique words phonemes16,293,828
8Number of G2P conversion errors for phonemes34,324
9The PER value for unique words0.211%
10Number of phonemes in the corpus1,263,992,460
11Number of G2P conversion errors for phonemes in the corpus3,713,206
12The PER value for the corpus0.294%
Table 19. The WER and PER values of the developed grapheme-to-phoneme conversion system after improvements.
Table 19. The WER and PER values of the developed grapheme-to-phoneme conversion system after improvements.
No.ParameterValue
1Number of unique words checked1,943,458
2Number of G2P conversion errors for unique words7525
3The WER value for unique words0.387%
4Number of words in the corpus230,300,300
5Number of G2P conversion errors for words in the corpus69,802
6The WER value for the corpus0.030%
7Number of checked unique words phonemes16,282,255
8Number of G2P conversion errors for phonemes8063
9The PER value for unique words0.050%
10Number of phonemes in the corpus1,263,415,734
11Number of G2P conversion errors for phonemes in the corpus73,786
12The PER value for the corpus0.006%
Table 20. The summary evaluation results of developed grapheme-to-phoneme conversion system, before and after improvement.
Table 20. The summary evaluation results of developed grapheme-to-phoneme conversion system, before and after improvement.
ValueValue
No.Parameterbeforeafter
ImprovementsImprovements
1The WER value for unique words1.731%0.387%
2The WER value for the corpus1.610%0.030%
3The PER value for unique words0.211%0.050%
4The PER value for the corpus0.294%0.006%
Table 21. A sample list of grapheme-to-phoneme conversion rules for orthographic letter “ą” [37].
Table 21. A sample list of grapheme-to-phoneme conversion rules for orthographic letter “ą” [37].
No.RuleOrthographicRowColumnPhoneme
NameLetterNumberNumberLetters
R i LRCP
1ą_1_1_oą11[o]
2ą_1_2_omą12[om]
3ą_1_3_oną13[on]
4ą_1_4_oną14[on]
5ą_1_5_oną15[on]
6ą_1_6_oną16[on]
7ą_1_7_on’ą17[on’]
8ą_1_8_o~ą18[o~]
9ą_1_9_o~ą19[o~]
Table 22. A sample list of the most frequently used grapheme-to-phoneme conversion rules in Polish.
Table 22. A sample list of the most frequently used grapheme-to-phoneme conversion rules in Polish.
No.Number ofFrequencyRule
Occurr. ofof Occurr.Name
1843069533in %
i C ( R i ) f ( R i ) · 100 R i
123060938412.512@_1_1_
223060639712.512#_1_1_
31207130486.550a_1_1_a
41070310555.807e_1_1_e
51022210665.546o_1_1_o
6520250132.823y_3_2_I
7504929272.740i_12_11_i
8474333332.574r_1_1_r
9440695322.391t_1_1_t
10433528712.352n_1_1_n
11389749572.115m_1_1_m
12389206932.112w_1_1_v
13348119701.889P_1_1_
14318760021.730j_1_1_j
15304079641.650u_1_1_u
16281499061.527n_1_7_n’
17262412501.424p_1_1_p
18241587231.311z_2_23_
19237708721.290ł_1_1_w
20213731081.160k_1_2_k
21212861631.155l_1_1_l
22206891421.123d_1_1_d
23168840690.916s_1_2_s
24160466420.871i_3_2_
25153103360.831c_1_3_ts
26151780890.824b_1_1_b
27142222270.772h_1_1_x
28136752100.742p_1_4_p
29128533300.697c_1_1_
30124872710.678z_11_52_
Table 23. The frequencies of letters in Polish orthography.
Table 23. The frequencies of letters in Polish orthography.
No.Number ofFrequencyLetter
Occurr. ofof Occurr. 
1345943574in %
i C ( c i ) f ( c i ) · 100 c i
11206466548.96245a
21115896758.28964i
31066657167.92385e
41021359767.58735o
5762985465.66797z
6754031465.60146n
7614457474.56461w
8610166754.53273r
9572374414.25198s
10535493553.97801c
11533996103.96688t
12522409533.88081y
13456348593.39007k
14446056353.31361d
15417331293.10022p
16389730752.89518m
17318513112.36613j
18314813682.33865u
19282747202.10044l
20237639981.76535ł
21198654371.47574b
22184051661.36726g
23154752391.14961ę
24142222271.05652h
25139060341.03303ą
26121301160.90111ż
27112046430.83236ó
2892800050.68938ś
2961193840.45459ć
3040870220.30361f
3124741960.18380ń
328265160.06140ź
331141380.00848v
34654500.00486x
35114340.00085q
Table 24. The evaluation results of the fit of Yule’s equation to the ranked frequency distribution of orthographic letters (1) and grapheme-to-phoneme conversion rules (2) in Polish.
Table 24. The evaluation results of the fit of Yule’s equation to the ranked frequency distribution of orthographic letters (1) and grapheme-to-phoneme conversion rules (2) in Polish.
No.Yule’s Equation R 2 RMSE
1 Y r = 0.066783 r 0.08 · 0 . 91 r 0.975092.9071 · 10−3
2 Y r = 0.12512 r 0.458 · 0 . 954 r 0.951071.8022 · 10−3
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Kłosowski, P. A Rule-Based Grapheme-to-Phoneme Conversion System. Appl. Sci. 2022, 12, 2758. https://doi.org/10.3390/app12052758

AMA Style

Kłosowski P. A Rule-Based Grapheme-to-Phoneme Conversion System. Applied Sciences. 2022; 12(5):2758. https://doi.org/10.3390/app12052758

Chicago/Turabian Style

Kłosowski, Piotr. 2022. "A Rule-Based Grapheme-to-Phoneme Conversion System" Applied Sciences 12, no. 5: 2758. https://doi.org/10.3390/app12052758

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop