MAKEDONKA: Applied Deep Learning Model for Text-to-Speech Synthesis in Macedonian Language

: This paper presents MAKEDONKA, the ﬁrst open-source Macedonian language synthesizer that is based on the Deep Learning approach. The paper provides an overview of the numerous attempts to achieve a human-like reproducible speech, which has unfortunately shown to be unsuccessful due to the work invisibility and lack of integration examples with real software tools. The recent advances in Machine Learning, the Deep Learning-based methodologies, provide novel methods for feature engineering that allow for smooth transitions in the synthesized speech, making it sound natural and human-like. This paper presents a methodology for end-to-end speech synthesis that is based on a fully-convolutional sequence-to-sequence acoustic model with a position-augmented attention mechanism—Deep Voice 3. Our model directly synthesizes Macedonian speech from characters. We created a dataset that contains approximately 20 h of speech from a native Macedonian female speaker, and we use it to train the text-to-speech (TTS) model. The achieved MOS score of 3.93 makes our model appropriate for application in any kind of software that needs text-to-speech service in the Macedonian language. Our TTS platform is publicly available for use and ready for integration. Another option is using Montreal Forced Aligner in to


Introduction
Text-to-speech (TTS) is a challenging problem has attracted researchers' attention over the past 30 years. When considering the literature, it can be easily perceived that the level of success is, however, determined by the evolution of methods that overcome the human abilities of analysing and extracting features. Instead, those methods provide a higher-level abstraction of characteristics that fill the gap of producing human-like speech-something that was missing in the early efforts. Generally, a traditional TTS system is comprised of two consecutive main parts: text analysis and speech synthesis. The text analysis part includes text processing, or, morphological analysis and rule-based syntax analysis. The speech synthesis part encompasses methods that are able to recreate a speech by using a defined set of rules. The main problem with the traditional models for achieving the text-to-speech translation is that they rely on one-layer nonlinear transformation units [1]. Those are Hidden Markov models (HMMs) [2][3][4][5], maximum Entropy based methods [6][7][8], and concatenation based synthesis methods [9][10][11].
It can be concluded that the existing attempts to create Macedonian TTS system, mostly rely on concatenation based synthesis methods, i.e., the synthesis is based on selecting an appropriate speech unit from a prerecorded and labeled speech corpus and adjusting the prosody of the concatenation unit according to the target context, as presented in the following Section 2. This approach quality is limited by the pitch period, or, starting point and the maintenance of smooth transitions [1].
In this paper, the focus is on creating the first Macedonian TTS system by following a parametric speech synthesis that is based on Deep Learning approach. The parametric speech synthesis has shown to significantly improve the naturalness of the generated speech [12][13][14]. Those approaches include three phases: text analysis, parameters prediction, and speech synthesis. All three steps have their traditional and DNN-based shape. The advantage of the DNN-based methods over the traditional ones is that they not only transform complex linguistic features into acoustic feature parameters, but also model the correlation between frames by using long short-term context information that improves the quality of the synthesized speech [1]. When compared to the HNN-based approach, in which the linguistic features are first mapped into probability densities, the DNN-based method directly maps the linguistic features to acoustic features.
The text analysis itself can be done at few different levels: phoneme level, syllable level, word level (analysis of part of speech-POS), phrase level, and sentence level.
Among the earliest attempts to use the leverage of DNN in the field of TTS is presented in [15]. The authors try to eliminate the decision tree clustering weakness (division of model space considering one feature at the time) in statistical parametric speech synthesis, by combining vector-space representations of linguistic context and DNNs.
One of the most famous and widely used is Tacotron, which is a seq2seq model [16], and its successor Tacotron2 [17,18]. Given (text, audio) pairs, Tacotron is able to directly synthesize speech from characters. Tacotron has been an inspiration for some newer approaches that successfully overcome the speed problem, producing close, or, equal quality speech. Such is FastSpeech [19] and FastSpeech2 [20], the newest TTS model published at the time. However, both autoregressive models depend on additional models for duration prediction, which means that Tacotron2 model has to be trained to some satisfactory accuracy and, therefore, used for duration prediction needed for training FastSpeech2. Another option is using Montreal Forced Aligner (MFA) in order to obtain the alignments between the utterances and the phoneme sequences.
Some valuable work done in the DNN-based speech synthesis field is presented in [21][22][23]. The papers present a consecutive progress of DNN-based synthesizers, more precisely, Deep Voice 1 and 2 [21,22] retain the traditional structure of TTS. They aim at separating grapheme-to-phoneme conversion, duration and frequency prediction, and waveform synthesis. Deep Voice 3 [23] is more complex character-to-spectrogram architecture, and it employs an attention-based sequence-to-sequence model. However, this approach proved to be most valuable for the successful creation of the Macedonian TTS system. Details can be found in Section 3.3.
Researchers are not only satisfied with the accuracy of the models, but also of the time-performance and resources-creed of the methods. The newest proposals deal with the quality-training time trade-off. Therefore, in [24], the authors describe a TTS technique that is based on deep convolutional neural networks (CNN), but without any recurrent units, instead, stressing the importance of training the attention module. This means that they propose a solution that will reduce the need of very powerful computing resources and also reduce the time that is required to run the experiments, which usually takes several weeks. In the experiment, they have done comparative analysis of whether the CNN-based algortihm will produce acceptable quality of speech. They have trained the model in 15 h by using an ordinary gaming PC that is equipped with two GPUs, and the results proved the efficiency of the network in terms of synthesized speech quality.
Some of the newest work is presented in [25]. In this work, the authors propose a new solution for parallel wave generation by WaveNet. The improved method in comparison to the previous work (Parallel WaveNet [26]), in which end-to-end speech synthesis actually refers to the text-to-spectrogram models with a separate waveform synthesizer (vocoder); this solution simplifies the training algorithm by introducing the first text-to-wave fully convolutional neural architecture for speech synthesis, which enables fast end-to-end training from scratch. Details regarding our experience in using this model architecture are discussed in Section 5.
The key contributions of this paper are the following: • Development of the first open-source TTS model for Macedonian language. • Development of the first high-quality speech dataset for Macedonian language required for training TTS model, which consists of 10,433 short audio clips of a single female speaker.

•
Retraining an architecture of fully end-to-end TTS synthesis, from text records to inference for Macedonian language.

•
The development of a speech recorder tool for recording and matching the audio clips to the corresponding text transcriptions, suitable for easier creation of any language dataset needed for TTS model training.

•
Establishment of guidelines for other researchers via discussion of the experience using state-of-the-art Deep learning networks architectures for TTS modelling.
The rest of the paper is organized, as follows. Section 2 presents the overall eminent work done in relation to creating Macedonian TTS system over the past 30 years. Section 3 describes the created dataset, data preprocessing, as well as the Deep Learning approaches experimented within the paper. Section 4 provides the results from the testing.
Section 5 presents some valuable lessons learnt when trying to achieve the desirable human-like Macedonian TTS, by exploring many TTS models implementations. We believe that those leads might be very useful for other researchers and they are not usually found in the related papers. Section 6 presents the overall conclusions that are derived from the development of the TTS system. In addition, the future directions for the possibilities to integrate the Macedonian TTS system in almost any kind of e-services in North Macedonia are also presented in Section 6.

Related Work
This section encompasses all of the previous prominent work done in relation to the effort to create a human-like text-to-speech synthesizer for Macedonian language. The first researches on the subject are since 1996. The authors [27] present an experimental setup for real-time TTS conversion using the classical Neural Networks approach, which was popular at that time. The results showed to be promising and ready to be integrated in a system that aimed to support humans with damaged sight that was an ongoing project since 1993; however, we did not find any paper that later describes the finished system. Next year, in 1997, the authors [28] present another effort to create a speech synthesizer that is based on time domain syllable concatenation.
In [29], the researchers present their experimental TTS system for Serbian, Croation, and Macedonian language. They characterize Serbian and Croatian to be tonal languages, meaning they have high-low pitch patterns that are permanently associated with words, whereas Macedonian language is a pitch-accented language with antepenultimate stress on most words, excluding clitics, words of foreign origin, as well as some other word groups. However, they provided a uniform dictionary-based strategy for lexical stress assignment to all three languages. The lexical stress assignment algorithm used has led to low errors for Serbian language; however, since they do not use separate speech databases for Macedonian, the speech quality has decreased, but it is reported to be still acceptable due to the fundamental similarity between phonetic inventories of the two languages.
Gerazov has done most of the valuable work in this field. In [30], the first steps towards building a model for Macedonian intonation are presented. The purpose of the models is to significantly improve the intonation generation module of their TTS system in the development phase. Further intonation analysis have been undertaken in [31], where the focus of the analysis has been set on the average intonation curves that are related to intonation phrases connected with punctuation: declaration starts, intermediates and ends, questions and exclamations. The results led to important exceptions to the existing rules and found new consistent patterns. In continuation to the research, the dynamics and variance of the intonation at different speakers has been investigated in [32]. In [33], the authors present details of the recording and segmentation process used in the creation of the unit inventory of the system in progress-"Speak Macedonian". The system itself uses a mixed-rank inventory featuring two sets of units: phones (34 basic phones, out of which for 28 there is a unique letter in the alphabet) and quasi-diphones (variation of the classic diphones-707 unique diphones extracted). The quasi-diphones is a different in the way that it encompass both the transition between the phones and the two phones together. The first paper presenting Macedonian diphone characteristics is given in [34].
Being in the final step of development of their TTS system, the authors in [35] focus on the Macedonian prosody as one of the most significant elements of the spoken language. The prosody comprises intonation, dynamics, and rhythm. Intonation, as understood from the papers presented, is essential for synthesizing speech with high-quality prosody. The authors consider two important steps in generating intonation patterns, which is, intonation prediction and pitch curve generation. Intonation prediction is the task of assigning intonation labels to the input text in accordance to the chosen intonation model. Pitch curve generation is closely related to the chosen prosody model (there are many available in the literature [36][37][38]). Therefore, instead of using the available models, the authors create a prosody model by themselves and present a pitch curve generation algorithm. Later, their research is focused on the emotion recognition in Macedonian speech [39].
Another trial has been done to create Macedonian TTS using concatenative speech synthesizers [40]. Those systems are considered to be simpler, since they do not rely on phonetic transitions and co-articulation, or, any other rules that are defined by linguists. Instead, concatenative speech synthesizers require well-defined recordings of speech in order to extract convenient speech segments. However, their solution mostly relies on diphones and they include certain disyllables that appear very often in the Macedonian language itself. Some interesting experimental synthesizer is also presented in [41], where the authors follow a cross-language development of speech recognition and synthesis applications for a Macedonian language by using the bootstrapping approach from a German acoustic model. To map between the source and target languages, the authors used both knowledge-based and data-driven approaches.
A new separate acoustic model was trained for HMM based synthesis. Even though the author faced low quality of the available recordings and sub-optimal phoneme mapping, the qualitative analysis showed that HMM synthesis produces satisfactory and, to some point, intelligible synthetic speech.
In [42], the authors try to generate a speech synthesis module for the Macedonian language based on the Russian language model. They achieved phones mapping between the Russian and Macedonian language, which showed good performance for some phones, but not so promising for some phones that do not exist in the Russian language.
A machine learning approach to morphological analysis and synthesis of Macedonian nouns is presented in [43]. Orwell's "1984" has been used for extracting a whole set of Macedonian nouns, which are then used for training a model and testing by 10-fold cross-validation. The results have been compared with 275,000 Macedonian noun forms and the accuracy achieved was above 90%.

Dataset Creation
Creating appropriate and suitable dataset is essential for any intelligent system creation, and so it is in this case. We followed the guidelines of the LJ Speech Dataset [44], which is considered to be a golden standard for testing the accuracy of the DNN-based synthesizers for English language, in order to create the dataset. The dataset was created by a female speaker, using a professional microphone that will eliminate most of the noise in the background. We created a software tool that automatically serves the speaker with the consecutive samples from the text corpus and enables on-click creation of audio files in order to facilitate the recording process. The software interface is presented in the following Figure 1. As an input, the speaker needs to provide a .csv file that contains the whole text corpus in the format (wav_path|text_sample). Subsequently, the software automatically reads line by line and the speaker generates .wav files by clicking on the "start" button. In order to be more user-friendly and less demanding for speaker interaction, as soon as the recording process starts, the button becomes red, and the button label is changed to "stop". One more click on the button, or just clicking the "enter" key from the keyboard, and the audio file is saved in .wav format. Consequently, a new portion of sample is displayed to be read. At the top right corner, the speaker is able to manually switch among the text samples if there is a need to record some sample all over again.
For this system, the text corpus was created by reading the Macedonian translation of Mark Twain's "Huckleberry Finn", which produced 15 h of recorded material that corresponded to a total of 7686 recorded .wav files. Subsequently, the corpus was extended to 10,433 records by carefully choosing specific sentences that cover range of complex words.
The translated sentence that is displayed in Figure 1 is the first sentence from the book whose original form is "You don't know about me, without you have read a book by the name of The Adventures of Tom Sawyer, but that ain't no matter.".

Corpus Preprocessing
Before the dataset is input into the deep-learning models, several preprocessing steps are applied in order to improve the quality of the audio and text files.
The editing process of textual files required additional changes in the original texts in order to facilitate the alignment between audio files and texts. First of all, the abbreviations were replaced with their original meanings. Subsequently, the typo errors were fixed to prevent confusing the deep-learning model, making sure that high-quality text-sound corpus is achieved. Hereupon, the replacement of numerical values is performed (like ordinal numbers, years, quantitative values, etc.) with their textual representations. These transformations were also applied in inference phase, because the models would not be aware of the language rules that are used in Macedonian language.
Audio files require additional preprocessing in order to remove the noise that may appear and cease the training process. Therefore, the silence at the beginning was trimmed as well as at the end of the audio samples in the dataset. It proved that trimming facilitates the alignment between the text utterances and audio samples, which decreases the time requirements for training. Next, the large sentences were split into smaller sentences in order to lower the model's processing requirements, only allowing recorded sentences up to 10 s long. Consequently, it helped to use larger batch size when using GPU with lower performances.

Deep Learning Approach
Deep Voice 3 is chosen to be the most appropriate for Macedonian language TTS system creation, since it outperforms other models in terms of the trade-off between speed and generated speech quality according to our experimental results (more details are found in Section 5). It is able to synthesize more than 10M sentences per day [23] by using the leverage of the GPU. Even more, its sound-to-phoneme mapping ability is the most suitable for Macedonian language, which is consistent and phonemic in practice, and it follows the principle one grapheme per phoneme. This one-to-one correspondence is described by the principle, "write as you speak and read as it is written" [45].
Deep Voice 3 is a fully convolutional architecture for speech synthesis. Its character-to-spectrogram architecture enables fully parallel computation and the training is much faster than at the RNN architectures. The quality of the architecture has been evaluated on many datasets that encompass 820 h of audio data from 2484 speakers [46]. The architecture generates monotonic attention behavior, which avoids error modes that are common at sequence-to-sequence models.
Deep Voice 3 architecture converts the characters, phonemes, stresses, and other textual features into a variety of vocoder parameters that are used as inputs into audio waveform synthesis models. Mel-band spectrograms, linear-scale log magnitude spectrograms, fundamental frequency, spectral envelope, and aperiodicity parameters are all vocoder parameters.
Mainly, the architecture consists of an encoder, decoder, and converter layer. The encoder layer transforms the previously defined textual features into internally learnt features representations. Those features are in a (key, value) form and they are fed into the attention-based decoder. The decoder uses its convolutional attention mechanism to transform those features into low-dimensional audio representation, i.e., mel-scale log magnitude spectrograms that correspond to the output audio. The hidden layers of the decoder are fed into the third converter layer, which is capable of predicting the acoustic features for waveform synthesis. Figure 2 presents the detailed architecture of the model and the methodology workflow.
In the following Figure 2, the whole workflow is depicted via four separate stages. The first is the Dataset creation showing the sources, the process of recording the sentences, as well as the production of files in a suitable format that is recognized by the TTS model. The following stage is Text and audio preprocessing and it is responsible for assuring high-quality input into the TTS model. It comprises four preprocessing steps that refer to the audio files and five preprocessing steps that refer to the corresponding text files. The audio files underwent noise removal in order to remove the background noise as a result from the technical equipment used for recording; then, silence trimming to equalize the beginning and end of the records; amplification to strengthen the spoken speech, and long audio segmentation into shorter parts of maximum 10 s (which was found to be the case in Ljspeech dataset [44]). The corresponding text files were improved in a way that all typo errors were fixed; also the unrecognized symbols and unnecessary white spaces were removed; the numbers, ordinals, and units were converted into full written words as well as the abbreviations, and eventually, the long sentences were split to match the audio files that were segmented in the previous step.

Results
To achieve single-speaker synthesis, approximately 20 h of Macedonian high-quality speech audio dataset recorded at a sample rate of 22.5 kHz was used. The training was performed by using NVIDIA Tesla P100 with 16GB RAM. The total training time by using a batch size of 16 took 21 days and 16 h. The training was completed after 620K steps. Figure 3 presents the attention changing during the training process. The model started to produce an intelligible, understandable, and partially human-like speech after 50 K steps, as observed from the figure. Hereafter, the model started to improve itself by loosing the "robotic" component in the synthesized speech and achieved completely human-like speech until the end of the training.
At each checkpoint, the model was evaluated on seven different sentences that are carefully chosen to be specific in order to test the ability of the models. By the term specific, it means that the examples cover special cases in the Macedonian language, such as: long words; compound words; comma somewhere in the sentence to check whether the model makes the appropriate pause when synthesizes the speech; sentences ending with fullstop, question mark, and exclamation mark to check whether the model is able to change the intonation in the synthesized speech; and, tongue twisters, and tongue twisters containing words with multiple adjacent consonants, such as the word "Shtrk". The audio files from the synthesized speech across the checkpoints are available on GitHub https://f-data.github.io/TTS/.  Figure 4 presents the metrics that speak of the performance of the training models. Loss function is a metric that refers to the accuracy of the prediction. The main objective is to minimize the model errors or minimize the loss function. Thus, the Deep learning algorithm will repeat as many times needed in order the loss to reach as flatter line shape as possible. In our case, the loss functions behaves in a desired manner, it gradually decreases, converging to a value of 0.1731 after 162.2 K steps in four days and 11 h of training.
Learning rate plays a vital role in minimizing the loss function. It dictates the speed at which we want our model to learn. This value must be set properly, since if not, for example, setting it too high, the model will not have time to learn anything and, thus, the results will be poor. The initial learning rate was set to 0.0005, as shown in Table 1. After four days and 5 h of training, or 151.2 K steps, it decreases to a value of 0.000081335.
The gradient norm that is presented in the same figure calculates the L2 norm of the gradients of the last layer of the Deep learning network. It is an indicator showing whether the weights of the Deep learning network are properly updated. If its value is too small, it might indicate vanishing gradient. This problem affects the upper layers of the Deep learning network, making it really hard for the network to learn and tune the parameters. On the contrary, if its value is too high, it may indicate exploding a gradient phenomenon. In such case, the model is unstable and it is not able to learn from data, since the accumulation of large error gradients during the training process result in very large updates in the Deep learning model weights.
The last performance metric refers to the ability of the model to predict the mel-spectrograms. The L1 norm of the metric is decreasing across the iteration steps, reaching 0.03304 after 172.8 K steps, or four days and 18 h of training.
The model and the samples can be found on GitHub https://github.com/f-data/TTS-Makedonka. The quality of the model has been assessed on the basis of reliable and valid listening tests in order to assess overall TTS model performance-the Mean Opinion Score (MOS) [47,48]. Table 2 presents the obtained MOS values for the ground truth and the selected TTS model, which are 4.6234 ± 0.2739 and 3.9285 ± 0.1210, correspondingly. Fifteen distinct listeners performed the assessment on 40 original (ground truth) and 40 synthesized audio files. According to [47,48], the obtained value indicates good quality audio, with no additional effort to understand the words, distinguishable sounds, not annoying pronunciation, preferred speed, and pleasant voice.

Discussion
During the process of selection, the most appropriate model for the creation of Macedonian TTS system, many state-of-the-art models were experimentally evaluated besides Deep Voice 3, such as: multiple implementations of Tacotron2 (both Tensorflow [49] and Pytorch [50]), implementation of Tacotron 2 for Russian language [51], Fastspeech [49], FastSpeech2 [49], ClariNet [25], and also audio synthesizers, such as Melgan [49] and Waveglow [52]. Melgan and Waveglow both showed very good performance, even after 100K iterations. Additionally, pretrained Melgan and Waveglow models in English language could be successfully used as audio synthesizers for Macedonian language without any difficulties.
Tacotron 2 is the most famous and promising model to produce a human-like speech. However, the training process takes days before intelligible speech is synthesized. We also considered a distillated version used to create TTS for Russian language. By principles of transfer learning ,we tried to fine-tune the Russian TTS model; however, the experiments were not as successful as expected.
FastSpeech and FastSpeech2 are much faster that Tacotron 2, however, they are not completely independent. Actually, they rely on models for duration prediction that could be either Tacotron 2 model or MFA. Training TTS systems from scratch for language other than English requires lots of time to prepare for using FastSpeech and FastSpeech2.
ClariNet's architecture is an extension of DeepVoice 3 and, therefore, it was taken into consideration for training our TTS model. However, ClariNet requires a pre-trained Wavenet model, meaning that we need to train reliable Wavenet model from scratch for Macedonian language and then to proceed training a ClariNet model. We were unable to accomplish this in a reasonable time limit due to our resource limitations and thus, we chose to work with Deep Voice 3.

Conclusions and Future Work
This paper presents a successful effort to train a human-like TTS model for the Macedonian language. After many attempts to implement an efficient and humanoid TTS system for the Macedonian language in the last 30 years, we are the first who built it and published the model that is available to use as a module in any kind of software that needs its service.
The methodology presented in the paper relies on a previously confirmed Deep learning-based methodology-Deep Voice 3. We built software for new records management and created approximately 20 h-long training corpus from scratch in order to achieve a high-quality model. The dataset has been preprocessed by following the example of the Ljspeech dataset [44], which is considered to be the golden standard for training English language TTS systems. The deep neural network has been adjusted according to the Macedonian language needs.
Intelligible speech has been synthesized after 56 K steps of training, and acceptable quality has been achieved, even after 100 K steps of training. However, the model has been improved in the later steps, and the robotic-like components in the synthesized speech have been almost removed after 200 K steps of training. The quality of the generated audio files has been assessed while using the MOS metric, which is commonly used to assess the TTS systems.
Many intelligent systems may benefit from such a TTS system, as are the recommendation systems developed for the social media networks [53][54][55], by establishing an interaction with the users and, thus, improving their experience by achieving human-like communication.
In future work, we will extend the dataset with new sentences that cover as many different fields as possible, including many new words with different pronunciation complexity. We will also work on improving the pronunciation of words with irregular accents in the Macedonian language. The word stress in the Macedonian language is antepenultimate and dynamic, which means that it falls on the third from last syllable in words with three or more syllables, and on the first or only syllable in other words. However, this rule is sometimes disregarded when the word is among the more recent ones in the language or it is from a foreign language.
The training dataset, the synthesized samples at different checkpoints, the source code, and the trained model are publicly available for further improvement on GitHub https://github.com/f-data/ TTS-Makedonka.