A Rule-Based Grapheme-to-Phoneme Conversion System

: This article presents a rule-based grapheme-to-phoneme conversion method and algorithm for Polish. It should be noted that the fundamental grapheme-to-phoneme conversion rules have been developed by Maria Steffen-Batóg and presented in her set of monographs dedicated to the automatic grapheme-to-phoneme conversion of texts in Polish. The author used previously developed rules and independently developed the grapheme-to-phoneme conversion algorithm.The algorithm has been implemented as a software application called TransFon, which allows the user to convert any text in Polish orthography to corresponding strings of phonemes, in phonemic transcription. Using TransFon, a phonemic Polish language corpus was created out of an orthographic corpus. The phonemic language corpusallows statistical analysis of the Polish language, as well as the development of phoneme- and word-based language models for automatic speech recognition using statistical methods. The developed phonemic language corpus opens up further opportunities for research to improve automatic speech recognition in Polish. The development of statistical methods for speech recognition and language modelling requires access to large language corpora, including phonemic corpora. The method presented here enables the creation of such corpora.


Introduction
Natural language processing often requires grapheme-to-phoneme (G2P) conversion of an orthographic text [1]. G2P converts strings of graphemes to corresponding sequences of phonetic transcription characters, directly from orthographic representations and it is crucial for many applications in various areas of speech and language processing [2]. The most frequent applications of grapheme-to-phoneme conversion occur in text-tospeech systems, which require high quality phonetic transcriptions to function well [3]. Tools for converting graphemes to phonemes are also used in theoretical and applied linguistics. Such tools are useful in many areas of linguistic research (e.g., phonetics, phonology, dialectology, and language acquisition), in order to obtain preliminary phonetic transcriptions of large language corpora [4].
The main goal of research on the conversion of graphemes to phonemes is improving speech recognition for the Polish language [5,6]. In addition, research studies have been conducted on the phonetic properties of Polish phonemes [7,8], speech recognition based on such analyses [9], speaker recognition [10][11][12][13][14] and new applications of speech and language processing (e.g., speech translation) [15]. Particularly good results in speech recognition are achieved through the use of statistical language models [16][17][18][19]. However, currently the field of language modelling is shifting from statistical methods to neural networks and deep learning methods [20,21]. The development of statistical and deep learning methods for speech recognition and language modelling requires access to large language corpora, including phonemic corpora [18][19][20][21][22][23][24][25]. The main motivation for undertaking this research on automatic grapheme-to-phoneme conversion and its application, was the development of effective methods of creating a phonemic language corpus for Polish, comprised of phonemic transcriptions derived from an orthographic language corpus through graphemeto-phoneme conversion.

Problem Formulation
The process of converting graphemes to phonemes in orthographic text involves converting a string of orthographic characters into a corresponding string of phonetic transcription characters (representing phonemes or allophones) [2]. A 'grapheme' is any of the units of any writing system for any language, a term coined by analogy with the 'phoneme' of a spoken language [26]. Graphemes include alphabetic letters, typographic ligatures, numerical digits, punctuation marks, and other individual symbols of writing systems. Since the orthographic text is the only source of pronunciation information in the process of converting graphemes into phonemes, this process must be based on appropriate formal rules, depicting the correct pronunciation of orthographic strings in a given language [27].
Phonemes are usually written in specially designed alphabets. The most widely used alphabet is the International Phonetic Alphabet (IPA) [28]. For the Polish language, as with other Slavic languages, a special transcription system, called the Slavistic Phonetic Alphabet (SPA), is most frequently used [29]. The other very commonly used phonetic alphabet is the Speech Assessment Methods Phonetic Alphabet (SAMPA) [30]. SAMPA is a machine-readable phonetic alphabet, using 7-bit printable ASCII characters, based on the IPA. Table 1 presents the phonemic inventory of Polish with examples, in the SPA, IPA, and SAMPA phonetic alphabets and corresponds to the set of phonemes used for the purpose of this study.
Automatic grapheme-to-phoneme conversion is not a new problem. The first linguist who noted it, and tried to provide a solution for a particular language (Czech), was H. Kučera [31]. Research on solutions to the automatic grapheme-to-phoneme conversion problem have also been initiated for other languages [32][33][34]. In Poland, the first linguist who wrote about the possibility of phonetic interpretation of text by machines was W. Doroszewski in 1969 [35]. The largest contributions to solving the problem of automatic grapheme-to-phoneme conversion for Polish, were the publications of Maria Steffen-Batóg [36,37]. The first implementation of a grapheme-tophoneme conversion algorithm for Polish, designed for the machine ODRA 1204, was made in 1971 by M. Warmus [38]. Further attempts to implement automatic graphemeto-phoneme conversion have also been reported in various publications from that time to the present [39][40][41][42][43][44][45].

Methodology
Systems for converting graphemes to phonemes can be implemented in many ways, often roughly classified as either dictionary-based or knowledge-based (rule-based) strategies, although there are many intermediate solutions. Data-driven (dictionary-based) solutions involve storing as much phonological knowledge as possible in a lexicon, while rule-based systems consist of rules based on inference or proposed by expert linguists. Both dictionary-based and rule-based methods for converting graphemes to phonemes have their advantages and limitations. Searching for a word in the lexicon is relatively computationally inexpensive, whereas most rule-based system algorithms consume significantly more computational resources to produce a phoneme sequence. In addition, the dictionary method requires a large phonetic dictionary and complex morphophonemic rules, while the rule-based method is unable to model full morphophonemic constraints on its own. Very often, a rule-based system will incorporate a dictionary as an exception list. This solution was used in the grapheme-to-phoneme conversion application presented here.

Conversion Rules
The first set of rules for the conversion of graphemes into phonemes for the Polish language was developed by Maria Steffen-Batóg and presented in her set of monographs dedicated to automatic grapheme-to-phoneme conversion of texts in Polish [36,37]. Knowledge-based grapheme-to-phoneme conversion, unlike data-driven approaches, exploits rules derived by humans and/or from linguistic studies [46]. The rules for rulebased grapheme-to-phoneme are typically formulated in the framework of finite state automata [47]. The primary advantage of such rule-based approaches is that they can provide complete coverage. To achieve this objective it is crucial to use the algebra of sets and, for this purpose, it is necessary to define the basic sets of graphemes employed. Examples of these basic sets for the Polish language are presented below [37]. One of the basic sets is Z, E, and S. Z is a set of alphabetic characters for Polish, E is a set of three graphemes that are non-alphabetic for Polish, and S is a set of special characters [37]: Z = {a,ą, b, c,ć, d, e,ę, f, g, h, i, j, k, l, ł, m, n,ń, o, ó, p, r, s,ś, t, u, w, y, z,ź,ż} (1) The union of sets Z ∪ S will be indicated by Z + S and defined as follows: Similarly, subtraction of sets Z \ S will be indicated by Z − S. Auxiliary subsets O ⊂ S and P ⊂ S are also defined as follows: The character "#" is a pause character between the words and the character "/" is a sign to separate characters within the word. The set X is a collection of alphabetic characters in Polish, including special characters: ,ę, f, g, h, i, j, k, l, ł, m, n,ń, o, ó, p, r, s,ś, t, u, w, y, z,ź,ż, q, v, x, . , ? , ! , , , : , ; , -, ( , ) , # , /} The set X does not contain digits because all numbers are always converted to alphabetic representations. Let Γ and ∆ be arbitrary sets of words in the X alphabet. The concatenation of the sets Γ and ∆, written as Γ· ∆ or Γ∆ is an operation defined as follows: The set of Γ· ∆ contains as elements all strings of characters in the form αβ for which α ∈ Γ and β ∈ ∆.For any words α, β, γ, δ we receive: Here is an example to illustrate this type of relationship: {s,c}· A = {s,c}A = { sa, są, se, sę, si, so, só, su, sy, ca, cą, ce, cę, ci, co, có, cu, cy } (11) where A is the set of vowels in Polish: It is also necessary to define empty words, written as 1, which do not contain any letters of the X alphabet. In this case, for any word α, equality is fulfilled: The process of automatically converting graphemes to phonemes can be described as a function F defined by the following formula: where: and where α is a sequence of a alphabetic characters, β is a sequence of b phonemic characters, b is determined by a, but a need not be equal to b at all. The set Y is the entire set of phonemic characters for Polish described by the SPA alphabet: , r, l, m, n,ń, N, f, v, s, z, š, ž,ś,ź, X, p, b, t, d, k, g,ḱ,ǵ, c, Z,č,Ž,ć,Ź} The grapheme-to-phoneme conversion of correctly written orthographic texts in Polish is the transformation of sequences written in the X alphabet, to a correct form in the phonemic alphabet Y. The grapheme-to-phoneme conversion F function can be described by a set of formal grapheme-to-phoneme conversion rules defining how each sequence α representing a word in the X alphabet, can be transformed into a correct sequence β in the phonemic alphabet defined by the set Y. The conversion rules are usually numerous with varying degrees of complexity. The size and complexity of a set of grapheme-to-phoneme conversion rules depends on the number of letters in the orthographic alphabet and the degree to which each letter can be pronounced differently in various contexts. The first set of grapheme-to-phoneme conversion rules for Polish was developed by Maria Steffen-Batóg and presented in her set of monographs dedicated to automatic grapheme-to-phoneme conversion of texts in Polish [36,37]. The knowledge contained in these publications was the basis for developing the grapheme-to-phoneme conversion algorithm for Polish implemented by the author in the Python programming language [48]. According to Maria Steffen-Batóg, all grapheme-to-phoneme conversion rules relating to each orthographic letter can be stored in a matrix called the grapheme-to-phoneme conversion rule-table for the letter in question. The general form of a grapheme-to-phoneme conversion rule-table is shown in Table 2 [37].  Table 2 was constructed with the following assumptions [37]: The α k letter in the upper left corner of the table indicates that it is the table of grapheme-to-phoneme conversion rules for the orthographic letter α k . The headings of the columns in the table (∆ 1 , ∆ 2 , ..., ∆ j , ..., ∆ m−1 , ∆ m ) represent various ways of phonetically transcribing the letter α k , depending on the letters following α in context. Similarly, the headings of the rows (Γ 1 , Γ 2 , ..., Γ i , ..., Γ n−1 , Γ n ) determine various ways of phonetically transcribing the letter α k , depending on the letters preceding α k in context. At the intersec-tions of the rows and columns, one finds the grapheme-to-phoneme conversions for the letter α k in each specific orthographic context. For example, the intersection of the i-th row and j-th column indicates the element β i,j , which implies the following conversion rule: • For each orthographic string s = γα k δ, where γ ∈ Γ i and δ ∈ ∆ j , the grapheme-tophoneme conversion process is defined as the assignment α k → β i,j . The symbol β i,j in a specific conversion rule could represent a single phonemic letter, sequence of phonemic letters, or empty letter, written as symbol "1". It is also worth remembering that certain combinations of orthographic characters in a particular language may be absent from the table. Therefore, there may be some orthographic contexts of the letter α k for which phonetic transcription cannot be determined. In such cases, specific fields in the table are left empty. This principle often allows the simplification of graphemeto-phoneme conversion rules tables.
One of the simplest grapheme-to-phoneme conversion rules for Polish is the rule for orthographic letter "a" presented in Table 3. According to the grapheme-to-phoneme conversion rule defined in Table 3, each orthographic letter "a", in all orthographic contexts (before or after any sequence of letters), should be converted into phoneme [a] [37]. a X X a An example of a more complex table containing nine grapheme-to-phoneme conversion rules, is the rule-table for letter "ą" presented in Table 4 [37].
The most complex are the tables of rules for converting graphemes to phonemes in Polish for the following orthographic letters: "z"-174 rules, "s"-157 rules, "r"-91 rules, "u"-88 rules, and "d"-rules.
As presented in the literature by Steffen-Batóg, the grapheme-to-phoneme conversion rules are clear and universal [37], but they are correct only for Polish. For any other language, it is necessary to define that language's own grapheme-to-phoneme conversion rules.
On the basis of Batóg's work [36,37], the author of this paper implemented the graphemeto-phoneme conversion rules for Polish in the Python programming language [49]. The implementation includes 975 conversion rules covering all 35 letters and special characters [48] of Polish orthography. Although many words have multiple correct pronunciations, the rules implemented here in this way produce only one (the most basic) pronunciation for any given word. The list of implemented grapheme-to-phoneme conversion rules for Polish is presented in Table 9, where: P is a set of special characters excluding {#, @, −, /}, "#" is a pause character at the end of a word, "@" is a pause character at the beginning of a word, "/" is a delimiter character and "-" is a dash character. Table 9. The list of implemented grapheme-to-phoneme conversion rules for Polish. The list presented in Table 9 shows that the number of grapheme-to-phoneme conversion rules implemented is 975. The number of cells (potential letter-context combinations) in grapheme-to-phoneme conversion tables is much larger, here 5162. This means that some of them (i.e., 4187 cells) are empty and have no function in the grapheme-to-phoneme conversion process. Empty cells represent letter-context combinations not allowed in Polish.

Conversion Algorithm
The grapheme-to-phoneme conversion rules are the foundation of the grapheme-tophoneme conversion algorithm implemented here [48]. The conversion algorithm defines how the function F(α) = β works and determines the method for obtaining sequences of phonemic β k letters on the basis of orthographic sequences α k and their contexts. The block diagram of the grapheme-to-phoneme conversion algorithm for a single orthographic letter is presented in Figure 1.  The grapheme-to-phoneme conversion algorithm, using the general form of graphemeto-phoneme conversion rules shown in Table 2, is shown here in terms of 12 steps: (1) Read an input orthographically represented word α, where α = α 1 α 2 . . . α k . . . α n and α 1 , . . . , α n ∈ X; (2) At the beginning of the orthographically represented word α, let k = 1; (3) Read a single letter α k of the input word α; (4) Check the context of the letter α k in the word α: • Read a string of letters γ = α 1 . . . α k−1 preceding the orthographic α k letter; • Read a string of letters δ = α k+1 α k+2 . . . α n following the orthographic α k letter; (5) Select the appropriate grapheme-to-phoneme conversion rule table for the orthographic letter α k ; (6) Select the appropriate j-th table's column for the orthographic letter α k , where δ ∈ ∆ j ; (7) Select the appropriate i-th table's row for the orthographic letter α k , where γ ∈ Γ i ; (8) Select the appropriate grapheme-to-phoneme conversion rule for the orthographic letter α k , as defined by the cell in the j-th column and i-th row of the table; (9) Use the appropriate grapheme-to-phoneme conversion rule for the orthographic letter α k , as defined by the cell in the j-th column and i-th row of the The grapheme-to-phoneme conversion rule for orthographic letter α k will be implemented by assignment α k → β k = β i,j ; (10) Increment k = k + 1 and read the next orthographic letter α k of the input word α; (11) Go to step number 3, unless k > n, where n is the length of input word α; (12) Output the result of the grapheme-to-phoneme conversion of orthographic word α = α 1 . . . α n where α 1 , . . . , α n ∈ X, is a new phonemic representation of the word β = β 1 . . . β n , where β 1 , . . . , β n ∈ Y. These 12 steps summarize the automatic grapheme-to-phoneme conversion algorithm. This algorithm can also be presented as "pseudocode", as shown in the list called Algorithm 1. Algorithm 1 . The algorithm for grapheme-to-phoneme conversion (implementation of function F(α) = β).

Results
This grapheme-to-phoneme conversion algorithm for Polish was implemented in the Python programming language as an independent application called TransFon [48]. The developed application allows the grapheme-to-phoneme conversion of any orthographic text file in Polish.
Evaluation of automatic grapheme-to-phoneme conversion implementations is crucial. It is necessary to determine the application's degree of success (accuracy). The evaluating and testing procedure for this automatic grapheme-to-phoneme conversion implementation consisted of the following elements: • Performing grapheme-to-phoneme conversion of an orthographic text corpus file containing the most frequently used words in Polish (those with different orthographic representations), obtained from resources in the National Corpus of Polish [50]; • Validation and verification of the conversion results for those words using the Polish language dictionary that specifies the correct pronunciation of words in Polish; • Registering cases of incorrect conversion, errors, and other problems encountered; • Attempts to solve the problems.
This grapheme-to-phoneme conversion application was implemented in such a way that the conversion algorithm stopped when a grapheme-to-phoneme conversion problem occurred (e.g., when there was no rule allowing for a correct phonetic transcription). This made it easier to improve and develop the conversion application. In addition, any doubts about correct pronunciations were resolved with help of the aforementioned dictionary.
In order to evaluate the implemented grapheme-to-phoneme conversion system, the word error rate (WER), and phoneme error rate (PER) were used, as the methods usually used to evaluate the performance of automatic speech recognition systems. The WER value for this grapheme-to-phoneme conversion system was calculated as the ratio of the number of incorrectly converted words to the total number of words converted from the language corpus. The PER parameter was similarly defined and calculated for phonemes.
The causes of problems and errors in the automatic grapheme-to-phoneme conversion operation were as follows: • Errors in the implementation of the algorithm or conversion rules; • Missing conversion rules in tables (rules not included in the tables) for some orthographic letters in contexts; • Problems with conversion of foreign words, acronyms and words that are not in the Polish language dictionary.
These problems were solved in the following ways: • Implementation errors in the conversion algorithm and rule tables were corrected by modifying the application source code; • The problem of missing conversion rules in tables has been solved by adding rules to the tables. It should be noted that the added grapheme-to-phoneme conversion rules cooperate with the rules implemented earlier and known from the literature [36,37].
In order to complete the rule tables, new rules were added for selected letters e.g.: "i", "n", "d", "z", "ż", "ć", "f", "s" in some contexts, in particular: adding or correcting conversion rules for the letter "i" for correct conversion of words : "unii", "będzie", "sobie", "razie", "diabeł", and similar contexts; -adding or correcting conversion rules for the letter "n" for correct conversion of words: "branży" and similar contexts; -adding or correcting conversion rules for the letter "d" for correct conversion When the letter "d" occurs at the end of a word, e.g.: "od", "pod", "przed", "nad", "miliard", "wyjazd", "Witold", "grand", "rajd", "hołd", "prawd"; -adding or correcting conversion rules for the letter "z" for correct conversion of words: "trzeci", "zamierza", "poradzę", "bezzwrotny", and similar contexts; -adding or correcting conversion rules for the letter "ż" for correct conversion of words: "tożsamość", "manadżer", and similar contexts; -adding or correcting conversion rules for the letter "ć" for correct conversion of words: "ćwiczenia", "dziećmi", "zadośćuczynić", "ćwiartki", and similar contexts; -adding or correcting conversion rules for the letter "f" for correct conversion of words: "Afganistan", "Hoffman", and similar contexts; -adding or correcting conversion rules for the letter "s" for correct conversion of words: "przeprosin", "siną", "Helsinki", and similar contexts; • The problem of converting foreign words and acronyms was solved by using the developed dictionary in which the rules for converting foreign words and acronyms were defined. Thus, the rule-based grapheme-phoneme conversion was supplemented by the dictionary method; • Examples of grapheme-to-phoneme conversion rule-table additions implemented by the author are shown in Tables 10-17. Table 10. The grapheme-to-phoneme conversion rule-table addition for the letter "i" in Polish for words: "unii", "będzie", "sobie", "razie", "diabeł", and similar contexts.  Table 11. The grapheme-to-phoneme conversion rule-table addition for the letter "n" in Polish for words: "branży" and similar contexts. nż X N Table 12. The grapheme-to-phoneme conversion rule-table addition for the letter "d" in Polish for words: "od", "pod", "przed", "nad", "miliard", "wyjazd", "Witold", "grand", "rajd", "hołd", "prawd", and similar contexts. d S A t {r, z, l, n, ł, w} t Table 13. The grapheme-to-phoneme conversion rule-table addition for the letter "z" in Polish for words: "trzeci", "zamierza", "poradzę", "bezzwrotny", and similar contexts. z X a {ą,ę} z tr 1 r 1 d 1 e s Table 14. The grapheme-to-phoneme conversion rule-table addition for the letter "ż" in Polish for words "tożsamość", "manadżer", and similar contexts. z sa er A ž d 1 Table 15. The grapheme-to-phoneme conversion rule-table addition for the letter "ć" in Polish for words: "ćwiczenia", "dziećmi", "zadośćuczynić", "ćwiartki", and similar contexts. c wic m u wi · A Xćć eć sć Table 16. The grapheme-to-phoneme conversion rule-table addition for the letter "f" in Polish for words: "Afganistan", "Hoffman", and similar contexts.  The results obtained were compared to the results available in the literature [51][52][53][54][55]. Table 18 presents WER and PER values of the developed grapheme-to-phoneme conversion system before improvements. Table 19 presents the WER and PER values of the developed grapheme-to-phoneme conversion system after improvements. The summary of evaluation results for the system developed, before and after improvements, is presented in Table 20. The results presented in Table 20 indicate that conversion performance improved significantly after the implementation of the improvement. The PER value for the corpus 0.294% The PER value for the corpus 0.006% The PER value for the corpus 0.294% 0.006% As the final result of improvements to the G2P implementation, the software application called TransFon was developed, which allows automatic grapheme-to-phoneme conversion of any orthographic text file in Polish [48]. Numerous tests of this software application on a large orthographic text corpus [56], in a variety of operating environments, confirmed the accuracy of the application's operation.
To demonstrate the operation of TransFon application, selected works of Polish literature were subjected to automatic grapheme-to-phoneme conversion. A sample fragment of Adam Mickiewicz's epic poem "Pan Tadeusz" and a fragment of the output file obtained after automatic grapheme-to-phoneme conversion are presented with the printouts attached to this paper. An example fragment of an input text file containing a orthographic text from the epic poem "Pan Tadeusz" by Adam Mickiewicz is shown in Listing 1 [57].
Listing 1. A sample fragment of the input text file containing orthographic text from Adam Mickiewicz's epic poem "Pan Tadeusz" .

An example fragment of an output file containing phoneme transcriptions of Adam
Mickiewicz's the epic poem "Pan Tadeusz" is shown in Listing 2.

Listing 2.
A fragment of the output file containing phonemic notation as a result of automatic grapheme-to-phoneme conversion of Adam Mickiewicz's epic poem "Pan Tadeusz".
One measure of the performance of an automatic conversion of graphemes into phonemes algorithm is word processing speed (WPS). The WPS value for the application TransFon was estimated by measuring the time required to convert the test text, Adam Mickiewicz's epic poem "Pan Tadeusz", containing 10,972 lines of text and 70,291 words [57]. Automatic grapheme-to-phoneme conversion of the whole text file took about 38 s. It is possible to calculate that the TransFon application can process text files at an average speed of 1849 words per second, which is 110,940 words per minute: The measurement of grapheme-to-phoneme conversion speed was performed on a PC workstation with CPU Intel Core i7 4770K @ 3.50 GHz. The computational power of this CPU is around 130,000 MIPS .
Another application of the TransFon software can be assistance in language corpora processing. Development of statistical methods for speech recognition requires access to large language corpora [58]. One of the largest corpora available for the Polish language is the National Corpus of Polish (NCP) [56]. The NCP corpus is available to the scientific community, offers great flexibility, and has great scientific value. This corpus provides the scientific community, particularly linguists, as well as computer scientists interested in natural language processing, materials showing the contemporary state of the Polish language, meeting all the scientific requirements. Word-frequency lists obtained using the National Corpus of Polish seem particularly valuable. Automatic grapheme-to-phoneme conversion allows creating large phonemic language corpora from orthographic language corpora. A phonemic language corpus for Polish was developed by the author using automatic grapheme-to-phoneme conversion of an orthographic language corpus, in order to be able to perform statistical phonological analysis of the Polish language, and to develop phoneme-based statistical language models for Polish to improve automatic speech recognition [17][18][19].
A sample fragment of the frequency list for the phonemic corpus of Polish developed is presented in Listing 3. The phonetic frequency list file contains 19,43,458 Polish words written orthographically, their phonetic transcriptions in the SAMPA phonemic alphabet, and additionally the number of each word occurring in the NCP source corpus. The total number of word-tokens in the NCP corpus is 230,300,300. This describes a plain text version of the NCP corpus file containing 230 million words in Polish. It should also be noted that, the standard SAMPA transcriptions for Polish include several sequences of phonetic transcription labels that may cause ambiguity unless separated by spaces or other characters. To avoid this problem, the individual phonemes were separated by square brackets.
Making the developed rule-based implementation available could be very useful to the community-both for linguistic research and for more in-depth performance comparisons between the current rule-based solution and others. Therefore, it is planned to make TransFon application available after making necessary corrections in the implemented algorithm and determining the conditions and license of making the application available for other users.

Statistical Approach
Various grapheme-to-phoneme conversion methods can be used, but three types are most important and most often used: rule-based methods, dictionary-based methods, and statistical methods [51].
Rule-based grapheme-to-phoneme conversion methods may be complemented by dictionary-based and statistical methods. The rule-based grapheme-to-phoneme conversion system discussed in this paper was complemented by the dictionary-based method. Statistical methods can also improve grapheme-to-phoneme conversion for Polish in the future. For this purpose, the author performed statistical analysis of the grapheme-to-phoneme conversion rules in Polish.
For better analysis of the obtained results, a special naming scheme for conversion rules was adopted. The naming conventions of the grapheme-to-phoneme conversion rules were as follows: where: R i is the name of the rule, L is an orthographic letter to which the rule applies, R is a row number in the grapheme-to-phoneme conversion table, C is a column number in the grapheme-to-phoneme conversion table, and P is a phonemic letter, to which the rule applies. A sample list of grapheme-to-phoneme conversion rules for letter "ą" using these naming conventions is presented in Table 21.
The frequency of each R i grapheme-to-phoneme conversion rule was calculated during automatic grapheme-to-phoneme conversion of an orthographic text corpus file in Polish containing 1,843,069,533 letters, obtained from the National Corpus of Polish. A sample list of the most frequently used grapheme-to-phoneme conversion rules in Polish is presented in Table 22.
The frequencies of grapheme-to-phoneme conversion rules used in Polish are presented in Figure 2. Table 21. A sample list of grapheme-to-phoneme conversion rules for orthographic letter "ą" [37].

No.
Rule Orthographic Row Column Phoneme Name Letter  The values presented in Table 22 depend on the frequencies of letters in Polish orthography, which were also calculated for the orthographic language corpus used. The frequencies of the letters of Polish orthography are presented in Table 23 and Figure 3.  The frequencies of words occurring in any language are well described by Zipf's law [59].
where Z r is the frequency of the word ranked, the rank of the word r reflects ranking from most frequent (r = 1) to least frequent (r = n), and a and b are parameters estimated from the statistical data. One usually finds that b is close to 1 [59]. Zipf's law is not restricted to language [60]. However, Zipf's law does not well describe the distributions of orthographic letters and phonemes in representations of words. The examination of such frequencies for 95 languages, as found in the literature [60], shows that phonemic and orthographic letter frequencies are best described by an equation first developed by Yule, which also describes the distribution of DNA codons [61].
The frequencies of letters in a language's orthography are well described by Yule's equation [60]: where Y r is the frequency of the letter, r is the rank of the letter, if frequencies are ranked from highest (r = 1) to lowest (r = n), and a, b, and c are parameters estimated from the statistical data. The fit of Yule's equation to the ranked frequency distribution of Polish orthographic letters is presented in Figure 4.   The average fit of Yule's equation to the ranked frequency distribution of the Polish letters was measured by the coefficient of determination R 2 . The coefficient of determination for the fit of Yule's equation, presented in Formula (34), correctly yields the ranked frequency distribution of Polish letters: Additionally, the root mean square error RMSE value was calculated for this case: The R 2 value indicates how well statistical data fit into a statistical model. The R 2 value here is equal to 0.97509 and indicates that Yule's equation fits very well the obtained statistical data on orthographic letter frequencies in Polish. Analogous regularity was observed for the frequency distribution of the graphemeto-phoneme conversion rules for Polish presented in Table 22. Figure 5 presents the fit of Yule's equation to the ranked frequency distribution of grapheme-to-phoneme conversion rules used for Polish.
The summary of evaluation results for the fit of Yule's equation to the ranked frequency distribution of orthographic letters (1) and grapheme-to-phoneme conversion rules in Polish (2) is presented in Table 24.  The values of R 2 presented in Table 24 indicate that Yule's equation fits the obtained statistical data for the frequencies of orthographic letters and the grapheme-to-phoneme conversion rules in Polish. On this basis, it can be concluded that the data obtained from statistical analysis of grapheme-to-phoneme conversion rules used in Polish, based on an orthographic language corpus, are reliable.
The results presented, from the statistical analysis of grapheme-to-phoneme conversion rules in Polish, represent the basis for a statistical approach to Polish G2P in a future implementation.

Conclusions
The results of the grapheme-to-phoneme conversion research presented in this paper were compared to other results published in the literature [3,27,39,[41][42][43]45,51,53,55,[62][63][64][65][66][67][68][69][70][71][72][73]. On the basis of this comparison, the following conclusions can be drawn: • Automatic conversion of graphemes into phonemes in orthographic texts is not only a technical issue, consisting in developing appropriate algorithms for converting graphemes into phonemes, but also a serious linguistic problem. Only specialists in linguistics and phonetics of a given language are able to formulate appropriate rules for converting graphemes into phonemes for speech [51]; • An additional complication is that automatic conversion of graphemes to phonemes is a language-specific problem with different spelling and pronunciation conventions within the same language [55,[68][69][70]; • Effective solutions for automatic grapheme-to-phoneme conversion in one language may not help solve the same problems for a different language. There is not only one language and technical problem of automatic conversion of graphemes to phonemes to be solved, but many different problems with different levels of difficulty that should be solved for each language separately [51]; • Automatic grapheme-to-phoneme conversion is widely used not only in speech synthesis, but also in speech recognition [3,53]; • A separate, but very important problem is the evaluation of grapheme-to-phoneme conversion processes [53,71]. Evaluation and validation of grapheme-to-phoneme conversion implementations is a laborious and time-consuming process. All problems registered for the G2P implementation discussed in this paper were positively resolved; • The G2P implementation developed for this research is not the only one for Polish [27,39,41,43,45], however only one of the others is available for free use [41]; • The author of the paper analysed for comparison the only available application for the Polish language, named Transcriber [41]. The application was implemented in the C++ programming language. The implemented method uses a dictionary of 5018 words and 767 defined conversion rules. For comparison, the software presented by the author in this paper was implemented in Python programming language, 975 conversion rules were implemented and the dictionary is very limited and plays only a supporting role. This means that TransFon has implemented 208 more transcription rules, which is over 27% more. The application failed to compile due to the lack of inclusion in the source code of the appropriate libraries that were used by the programmer to create the application. This made it impossible to evaluate the correctness of the application and seriously hindered the comparison with the software created by the author of the paper; However, based on the analysis of the application's source code, you can see that the principle of the application is also rule-based, but the author of the Transcriber application tried to refine and improve the application's performance by adding new words to the dictionary (exceptions). The author of the TransFon application, on the other hand, tried to add and supplement transcription rules in a similar way as is known in the literature. This is evidenced by the dictionary size used in both applications; • The G2P system presented here could be used for Polish corpus development; • The G2P implementation presented here did not exploit any similar pre-existing tools [48]; • It is worth noting that the solutions presented here for the development of language and speech corpora in Polish are not the only ones and publications on this subject are available [72,73]; • Of particular interest are the results presented in publications by Grażyna Demenko et al. [39,[62][63][64][65][66][67].
This paper presents a rule-based method for grapheme-to-phoneme conversion and an implementation for Polish. The major original author's achievements presented in this paper are as follows: • Implementation of the known from the linguistic literature rules of converting graphemes into phonemes for the Polish language in the Python programming language [36,37,40,44]; • Developing an algorithm for automatic conversion of graphemes into phonemes for the Polish language and implementing it in the Python programming language with numerous improvements; • Development of a software for automatic conversion of graphemes into phonemes called TransFon, which enables automatic conversion of graphemes into phonemes of any orthographic text files in the Polish language; • Application of the developed methods to create phoneme-based language corpora using the automatic conversion of graphemes to phonemes; • Statistical analysis of the occurrence frequency of particular grapheme-to-phoneme conversion rules in Polish; • Comparison of the results obtained with those published in the literature and discussion.
It should be noted that the research presented in this article used basic principles and the fundamental grapheme-to-phoneme conversion rules developed by Maria Steffen-Batóg and presented in her set of monographs dedicated to the automatic grapheme-to-phoneme conversion of texts in Polish [36,37]. The author used previously developed rules and developed independently the grapheme-to-phoneme conversion algorithm and software application. The application allows to convert any text in Polish orthography to the corresponding strings of phonemes as well as creating large phonemic language corpora based on orthographic language corpora.
The system for rule-based grapheme-to-phoneme conversion implemented here is complemented by dictionary-based methods, and was used to obtain statistics for the use of grapheme-to-phoneme conversion rules in Polish, potentially enabling the improvement of grapheme-to-phoneme conversion for Polish in the future.
The grapheme-to-phoneme conversion system developed and its ability to create phonemic language corpora for Polish open up further opportunities for research on improving automatic speech recognition in Polish. The plan for further research towards achieving this goal, using the phonemic language corpus developed, includes: • Performing a better and more detailed statistical analysis of the Polish language based on the phonemic language corpus developed [17,19]; • Developing more efficient word-based and phoneme-based statistical language models for speech recognition applications in Polish [18,19]; • Application of deep learning methods to language modelling and speech recognition [20,21].
The main problem in the development of phoneme-based statistical language models for Polish is the difficulty in obtaining sufficiently large phonemic language corpora. The phonemic language corpus development method presented in this paper, based on automatic grapheme-to-phoneme conversion, can significantly remedy this problem.
Funding: This work was supported by the Polish Ministry of Science and Higher Education funding for statutory activities on Silesian University of Technology.