Tonal Contour Generation for Isarn Speech Synthesis Using Deep Learning and Sampling-Based F 0 Representation

: The modeling of fundamental frequency ( F 0 ) in speech synthesis is a critical factor affecting the intelligibility and naturalness of synthesized speech. In this paper, we focus on improving the modeling of F 0 for Isarn speech synthesis. We propose the F 0 model for this based on a recurrent neural network (RNN). Sampled values of F 0 are used at the syllable level of continuous Isarn speech combined with their dynamic features to represent supra-segmental properties of the F 0 contour. Different architectures of the deep RNNs and different combinations of linguistic features are analyzed to obtain conditions for the best performance. To assess the proposed method, we compared it with several RNN-based baselines. The results of objective and subjective tests indicate that the proposed model signiﬁcantly outperformed the baseline RNN model that predicts values of F 0 at the frame level, and the baseline RNN model that represents the F 0 contours of syllables by using discrete cosine transform.


Introduction
The fundamental frequency (F 0 ) contour plays an important role in speech synthesis systems. The F 0 contour controls the intelligibility and naturalness of synthetic speech. In the speech synthesis of a tonal language, tone is correlated with F 0 . Tone correctness is crucial because words with different tones convey different meanings even if their phonemes are similar [1]. Thus, it is necessary to generate an appropriate tone contour for tonal languages.
Several studies have proposed speech synthesis for tonal languages [2][3][4] and a few have developed speech synthesis for the Isarn language [5], a dialect of Thai. Isarn is classified as a low-resource language. In our previous work [5], the hidden Markov model (HMM)-based speech synthesis was proposed for Isarn which can generate synthetic speech with an acceptable level of naturalness. However, the generation of inappropriate F 0 contour still degrades the naturalness of synthetic speech because considering the values of F 0 frame by frame is insufficient to model the suprasegmental features of the F 0 contour [6,7].
In the context of tonal languages, several studies have attempted to improve the modeling of F 0 for HMM-based speech synthesis. For example, a simple tone-separated tree structure with contextual tonal information was proposed to improve the correctness of the tone of HMM-based speech synthesis in Thai [8]. Multi-layer F 0 models for HMM-based speech synthesis have been proposed as well [6,7,9] by using models of F 0 to represent its patterns for different prosodic layers. These proposals can improve tone correctness and the naturalness of synthetic speech. However, performance is limited by decision tree-based clustering [10].
In the last few years, several types of neural network-based methods of speech synthesis have been proposed to overcome the limitations of HMM-based speech synthesis. Deep neural networks (DNNs) [10] have been used in place of decision tree-based clustering. However, DNN ignores the sequential nature of speech because it assumes that each frame is sampled independently. To capture long-term dependencies, the recurrent neural network (RNN) with long-short term memory (LSTM) has also been applied as an acoustic model [11][12][13].
End-to-end speech synthesis using advanced neural network architectures has also been proposed, such as in Tacotron [14] and Deep Voice [15]. These techniques directly convert raw text into speech waveforms and generate speech that sounds more natural than both parametric-based and concatenative approaches. However, they require a large set of speech and text pairs for training, which is time consuming and expensive to collect [16]. These models are thus challenging to apply to low-resource languages. Thus, this study focuses on parametric-based speech synthesis approach.
As described in [17,18], the shape of the F 0 contour of a syllable can deviate due to several factors, such as tone co-articulation, stress, and declination effects. Thus, the contour should be modeled at the syllable level than the frame level. Several models are available for representing the F 0 contour, such as the tilt model [19,20], Fujizaki's model [21], pitch target approximation models [22], and the discrete cosine transform (DCT) [23,24]. These models substantially improve performance in terms of the generation of F 0 . Nevertheless, some F 0 representation models require manual annotation to correct (e.g., the Fujizaki and pitch target approximation models) [25,26].
Although the RNN can capture suprasegmental characters of the F 0 contour, its internal structure of connections considers F 0 at the frame or segment level only. In particular for tonal languages, tone contour and its deviation are also characterized at higher levels (e.g., syllable and word) [27][28][29].
In this paper, we propose a tone modeling technique based on the RNN for speech synthesis in Isarn. We first propose a sampling-based method of representation of F 0 that can capture rich information concerning the tone contours within a syllable for Isarn. Syllable-level features are then modeled by the RNN to learn the suprasegmental characters of F 0 at higher prosodic levels. We also explore several architectures of RNN models and training strategies to generate tone contours. In terms of the linguistic level used for modeling and the representation of F 0 , we compare the proposed model with a frame-based model and other established F 0 transforms, such as the DCT, using objective and perceptual evaluations.
The remainder of the paper is organized as follows: Section 2 briefly introduces the Isarn language and its tone, challenges posed by tone modeling, and past work in speech synthesis for Isarn. Section 3 describes the proposed F 0 model, and Section 4 describes the experiments and results. Our conclusions and recommendations for future research are presented in the final section.

Isarn Language and Tone
Isarn is a tonal language spoken in Northeastern Thailand. Locals speak Isarn in different dialects depending on the region. Some Isarn dialects have five tones but others have six [30]. In this research, we focus on the dialect spoken in Central-Northeast Thailand (covering Khon Kaen, Kalasin, Udonthani, and Mahasarakham provinces). There are six tones in the central Isarn dialect: Mid (M), low (L), mid-falling (MF), high falling (HF), high (H), and rising (R). The following examples show a word consisting of the same phonemes in different tones: M ค  า /k h a:/ ("cost"), L ฆ  า /k h à:/ ("kill"), MF ค  า /k h â:/ ("trade"), HF ค า / k ʰ a ᷇ ː / ("stick"), H ข  า /k h á:/ ("galangal"), and R ข า /k hǎ :/ ("leg"). Figure 1 shows the typical pattern of each tone analyzed using the speech of a native male speaker.  In the past, the Isarn was written using the Isarn Dharma script. The difficulty of representing the Isarn by the Isarn Dharma script is its written style which lacks tone markers to identify the tones [31]. Nowadays, people mostly use Thai script, which is the official script of Thailand, to write the Isarn language. Although Thai script has tone markers, there is no formal standard rule for written Isarn language [32]. The written depends on the personal style. The same grapheme words may pronounce in different sounds with different meanings depending on surrounding words. This leads to ambiguity problems in pronunciation and text processing [33]. For example, the Isarn word "ย า ม " can be pronounced as /ja:m/ ("visit") in the sentence " ข  อ ย ไ ป ย า ม เ จ  า อ ย ู  บ  า น " / k ʰ ɔ j p a j j a ː m c â w j ú b â ː n / (means "I visit you at home" in English) or "/ a ᷇ ː m/ " ("time") in the sentence "ย า ม ไ ด  เ จ  า ส ิ ไ ป โ ร ง เ ร ี ย น " / ɲ a ᷇ ː m d ǎ j c â w s ì p a j l o ᷇ ː ŋ l i ː a ᷇ n / (means "What time will you go to school?" in English). In this study, the conversion of text into a linguistic specification is achieved using the front-end module [33] and corrected manually using audio recordings for reference.

Challenges of Tone Modeling
In tonal languages, tone behavior is complex in continuous speech even though there is only a finite number of tones. Several studies have been carried out on deviations in tone such languages as Thai and Mandarin [18,[34][35][36], but Isarn has not yet been thus considered. We examined these studies for the extracted F 0 contours from Isarn to identify factors affecting deviations in tone contours in continuous speech. Based on past studies, many factors affect deviations in tonal contours in continuous speech, such as tone co-articulation, stress, and declination. Figure 2 shows a comparison of the F 0 contour of a sentence in Isarn pronounced in isolation and continuously. As is shown, the F 0 contour of continuous speech deviated due to these factors.
Tone co-articulation is a phenomenon whereby the shape of the F 0 contour of a given syllable is affected by F 0 contours of adjacent syllables because the articulatory organs cannot respond rapidly enough to preserve the shape of the F 0 contours of the uttered syllables. For example, in Figure 2b, the F 0 contour of syllable / ma ᷇ n / is assimilated into syllable / ma ᷇ ː / . Note that shapes of their F 0 contours differ from the patterns of their pronunciations in isolation as shown in Figure 2a. The F 0 contours of stress syllables differ from those of unstressed syllables and more closely approximate a stable pattern than them-for instance, the F 0 contour of the syllable / ma ᷇ n / (without stress) and that of the syllable / d ǝ ᷇ ː / (with stress). The declination effect refers to the downward trend of the F 0 level to conform to the pattern of intonation of a larger prosodic level, such as that of a phrase or a sentence.

Past Work on Isarn Speech Synthesis
HMM-based speech synthesis for Isarn was developed in [5]. The waveform of the speech units was not used directly, instead linguistic and acoustic features were extracted from a speech corpus. In the training stage, the waveform of speech is converted into acoustic features, including F 0 and spectral features. The components of the acoustic parameters depend on the speech vocoder. Transcription is converted into linguistic features used as input by a text analysis module. The acoustic models are then trained using the extracted acoustic features and linguistic features. A duration model is also trained to determine phone duration. In the synthesis stage, linguistic contextual features are extracted from the input text and fed to the duration model. A sequence of speech parameters is then generated using the trained acoustic model and the information obtained concerning the phone duration. Finally, the waveform of speech is synthesized through the speech vocoder.
In HMM-based speech synthesis for Isarn, the tonal syllable is modeled by using two or three contextual phone models, including the initial phone model, vowel phone model, and final phone model (optional). The six tones in Isarn are represented in terms of tone-context features.
However, the HMM-based speech synthesis for Isarn still generates unnatural speech related to the generation of F 0 . Therefore, we examined past studies to identify a method to improve the performance of Isarn speech synthesis. We found that RNN-based speech synthesis had achieved the best performance for other languages [11,12,37,38]. We also preliminarily implemented RNN-based speech synthesis for Isarn, and observed the synthetic speech. RNN-based speech synthesis often generated unnatural speech due to the generation of inappropriate F 0 contours. Therefore, we attempted to improve the F 0 model for Isarn speech synthesis.
At a linguistic level, the proposed model does not consider temporal dependencies across frames, but across syllables. In tonal languages, the tone is indicated by the F 0 contour at the syllable level [1]. The deviation in the F 0 contour also occurs across syllables. From the perspective of feature representation, we propose a sampling-based approach that represents the F 0 contour of the syllable by using sampled F 0 values and their dynamic features. We expect that the sampling-based method can provide rich information for modeling suprasegmental features of the F 0 contour. This model consists of two steps. First, the F 0 contour of the syllables is represented by the sampling-based method. Second, linguistic features are mapped into the extracted parameters by using the RNN. The development of the proposed model can be divided into two parts: F 0 contour modeling and synthesis, as shown in Figure 3. Details of each part are described below.

F 0 Contour Modeling
We construct the F 0 model based on the RNN. This part consists of three processes: Linguistic feature extraction, sampling-based F 0 representation, and model training. The details of each process are provided below.

Linguistic Feature Extraction
In this section, we formally describe the representation of input features for modeling the F 0 contour. We extracted linguistic features by using a question set and context-dependent labels of the HMM-based Isarn speech synthesis [5]. The linguistic features considered are based on multi-prosodic layers, such as syllable, word, intermediate phrase, intonation phrase, and utterance as listed in Table 1. Linguistic features can be divided into four parts: Tone, phone identity, position, and features of duration.
Tonal features are important for predicting the F 0 contour of syllables for Isarn speech synthesis, as investigated in our previous work [5]. Tonal features can help infer the rough curve of the contours of a tone. However, other contextual features are needed to perfectly model the F 0 contour because using tonal features alone cannot be used to deal with complex variations in the F 0 contour.
Features of phone identity are combined to represent a syllable. Each syllable is represented by a one-hot vector encoding the phone identity and phone categories of the initial consonant, vowel, and final consonant. In Isarn, the final consonant is optional. We add a flag here to indicate the given syllable that is not the final consonant.
Positional features relate the number and positions of syllables in the higher prosodic layers. When examining examples of utterances in the Isarn speech corpus, we found that the F 0 contour is also related to stress information, however annotation is time consuming and expensive. Stress syllables typically have a long duration and appear at the end of an utterance or intonation [34]. We thus included sectional features that consider the given position of the syllable at a higher prosodic level and duration-related features. The effect of each feature set is investigated in Section 4.4.2.  [39] used the Fujisaki model to generate contours of F 0 without any modification, but this model does not perform as well as rule-based and frame-based approaches. This suggests that the parameters of the Fujisaki model are complex and challenging to predict. Ronanki et al. [40] used coefficients of DCT to represent the template of an F 0 contour, but this approach still requires frame-level features to generate a smooth output of the F 0 contour.
Instead of converting values of F 0 into parameters of other domains (e.g., the Fujisaki model and DCT), we propose a sampling-based approach to represent the F 0 contour. The main idea of the sampling-based model of F 0 is to represent the F 0 contour within syllables by sampling it with an appropriate number of points. Dynamic features are included to provide temporal information concerning values of F 0 in the given syllable and adjacent syllables. The use of dynamic features guarantees that the sampling-based method can produce a smooth output of the F 0 contour. The output of the process is used as features for training the RNN. Details of the proposed method of representation of F 0 are described as follows.
Before modeling a given F 0 contour, interpolation and smoothing must be performed. The interpolation process fills artificial values of F 0 between its intermediate values where unvoiced speech segments or short pauses occur. The unusual values of F 0 (e.g., error points and micro-prosody) from regions of unvoiced speech are eliminated automatically using the phoneme region described in the label files. Unvoiced speech segments are interpolated using piecewise cubic interpolation [41] and smoothed by a median filter. Note that we manipulate F 0 in the logarithmic domain.
Then, values of F 0 in each syllable are sampled. The F 0 contour of utterance F = [ f 1 , f 2 , . . . , f T ] is represented by concatenating the sequence of the sampled points of N syllables The values of F 0 in each syllable are sampled as follows: where f denotes the smoothed F 0 contour, b i and D i are the starting frame position and duration of syllable i, respectively, and K is the number of sampling points per syllable. The output vector C contains the KN of the sampled values of F 0 .
To improve the performance of the model, additional features are used because using only the sampled values of F 0 does not guarantee the continuity of the F 0 contour between adjacent points of syllables. Thus, we use dynamic features computed from the sequence of sampled values of F 0 . We expect dynamic features within and across syllables to improve the continuity of the generated where C is the vector of the sampled sequence of log F 0 and W n is a window matrix for calculating the n-th dynamic feature described in [42]. Syllable-level features Y = [y 1 , . . . , y N ] are prepared by reshaping the sampled F 0 vector O as follows: In our method of representation, an appropriate number of sampling points K is required because an inappropriate K can eliminate detail or create unnecessary values of F 0 . We tuned the model by setting K to the approximately mean of the duration of all syllables in the Isarn speech corpus. We also explore the effects of the number of sampling points in Section 4.4.3.

RNN Training
We use the RNN to map input features to output features. The RNN has been proposed to overcome the limitations of a feedforward neural network (FFNN) that ignores temporal information in sequential data. In the RNN, information from previous time steps is considered as input to the next time step. Given a sequence of input feature vectors, [x 1 , ..., x T ], and a sequence of output feature vectors, [y 1 , ..., y T ], the RNN computes the hidden state vectors and output vectors for a given input sequence as follows: where, x t , y t , and h t are the input vector, output vector, and hidden state vector, respectively, at time t.
σ h is the activation function of the hidden layer, σ y is the activation function of the output layer, W xh denotes the weight matrix between the input and the output layers, W hh is the weight matrix between consecutive hidden states, W hy is the weight matrix between the hidden and the output layers, and b h and b y are the bias vectors of the hidden layer and the output layer, respectively. Typically, the conventional RNN can use only information from the past. The bi-directional RNN has been developed to use information from past and future inputs. It outperforms the unidirectional RNN in many tasks, such as the front-end of text-to-speech systems [43], speech recognition [44], speech synthesis [11], and machine translation [45,46]. The bi-directional RNN processes the input sequence forward and backward to capture past and future information, respectively, in each layer. Then, the two hidden states are concatenated to produce the output. The iterative process is as follows: where − → h and ← − h denote the forward and backward hidden state vector sequences, respectively. In practice, the performance of the RNN is limited when modeling long-term dependencies in sequential features, called the vanishing gradient problem. We use the recently proposed recurrent units, such as the gated recurrent unit (GRU) [45] and recurrent long short-term memory (LSTM) unit [47], to solve this problem. Based on past work [48], we employed the recurrent bidirectional LSTM (BLSTM) unit.
To train the model, the input features are a sequence of linguistic features. Each input feature vector contains 153-dimensional linguistic features extracting from linguistic features in Table 1. The output feature vector includes 3K-dimensional of sampled F 0 values with dynamic features, where K is the number of sampling points per syllable. Theoretically, contextual information is modeled by the internal connection of the recurrent model structure. We thus seek to eliminate contextual information and include only tone for context. The output feature is a sequence of F 0 vectors and its dynamic future sampled from the original F 0 contour. Both the input and output features are normalized to have zero mean unit variance.
The hyper-parameters (i.e., the number of hidden layers, number of hidden units, and learning rate) of all models were tuned to achieve close to optimal results on the development set. The model weights were optimized using the Adam-based back-propagation algorithm [49]. To avoid over-fitting, we applied early stopping criteria to stop training when the validation loss had stopped decreasing in 10 consecutive epochs. The maximum number of epochs was set to 150. All models in this work were implemented by using the Keras framework with TensorFlow as back-end [50,51].

Synthesis of F 0 Contour
To generate the F 0 contour, the input text is converted into linguistic features using a text analysis module and these features are used to predict the duration of the phonemes using the RNN-based duration model. Features of duration are included in linguistic features and the input features are fed to the trained RNN-based F 0 model to obtain the sequence of output feature vectorsŶ = [ŷ 1 ,ŷ 2 , . . . ,ŷ N ]. Then, the predicted output features comprising the sampled values of F 0 and dynamic features are transformed into a feature sequenceÔ = [ô 1 ,ô 2 , . . . ,ô NK ], and the F 0 contour is generated using a parameter generation algorithm [52] with global variance computed on log F 0 . Finally, the smoothed F 0 vector is scaled to the duration of the syllable. Note that the proposed model generates a continuous F 0 contour. Voiced/unvoiced flags are obtained from the baseline RNN-based acoustic model described in Section 4.5.

Experiments and Results
In this section, we describe the speech corpus and method of feature extraction and the construction of the proposed model, the conventional frame-based model, and DCT-based models of F 0 , where the last two were employed as the baseline. The performance of the proposed model and baseline models was measured in terms of objective and subjective evaluations.

Speech Corpus and Feature Extraction
An Isarn speech corpus contains 4700 utterances uttered by one male native Isarn speaker (five hours of speech) [5]. The speech corpus was carefully uttered in the reading style by using text gathered from many sources, such as news articles, web pages, and web boards. The statistical information of utterances in the speech corpus is given in Table 2. The total number of syllables for each tone is shown in Table 3. We divided the speech data into three subsets: 3980 utterances for training, 300 utterances for validation, and 420 utterances for testing. We used a sampling rate of 32,000 Hz instead of the 16,000 Hz used in other studies because this does not degrade the quality of the synthetic speech and is equivalent to using higher sampling rates [53]. These acoustic features were extracted using the WORLD vocoder [54] with a 5 ms frame shift. The acoustic features consisted of three parts: The Mel-cepstral coefficients, band aperiodicity, and log F 0 .

Optimization and Evaluation Metrics
To evaluate and optimize the performance of the model, the root mean-squared error (RMSE) and correlation (CORR) were used. These metrics were used by considering only the frame in which F 0 extracted from natural speech and its predicted value were voiced. Hence, they were modified as: where V is the set of time indices where both the extracted F 0 for natural speech and its predicted value were voiced. f t andf t denote the extracted F 0 and the predicted F 0 , respectively. µ f and µf are mean values of the extracted and predicted F 0 , respectively. T is the total number of frames. The average values of RMSE and CORR on the test set were used as objective metrics. A lower RMSE and higher CORR indicated better prediction performance.

Baseline Systems
In comparison with recent proposals [11,12,38], RNN-based models are significantly better than HMM-based and DNN-based models. Therefore, we used the RNN-based models as a baseline. In terms of the representation of F 0 , the DCT-based model was used as a baseline because of its advantages reported in [23]. Details of the implementation of the baselines are described below.

Frame-Based Model
The frame-based model was trained [11,12] to generate only F 0 because the performance of the F 0 model degrades when F 0 and the spectral parameters are modeled simultaneously [55]. The input features used consisted of the same set employed on the model for generating speech, whereas the output features consisted of log F 0 with dynamic features and a voiced/unvoiced flag. Similar to [11], we included silence in all frames to maintain the continuity of the F 0 contour within a sentence. The hyper-parameters were tuned as in the proposed model. The network structure with the best performance consisted of three feedforward layers with 256 nodes per layer, where the top two hidden layers had a BLSTM structure, and each combined 128 forward units with 128 backward units. The model was trained with a learning rate of 0.0001 and had a mini-batch size of 64.

DCT-Based Model
We also trained the model by using DCT coefficients to represent the F 0 contour within the syllable. This model and the proposed model were trained using the same input features but their output features were different. Based on past studies [23,24], we used 10 DCT coefficients C = [c 0 , c 1 , ..., c 9 ] to represent the F 0 contour of each syllable, where c 0 represents the mean of F 0 over the syllable and the other coefficients represent the curve of F 0 within the syllable. The network structure with the best performance consisted of three feedforward layers with 512 nodes per layer, where the top two hidden layers had a BLSTM structure, and each combined 128 forward units with 128 backward units. As with the frame-based model, the model was trained with a learning rate of 0.0001 and a mini-batch size of 64.

Proposed Model Construction
In this section, we describe the construction of the proposed model and analyze several factors to achieve the best performance. These factors include architectures for the training model, linguistic features to feed as input features, and the number of sampling points per syllable used as output features.

Analysis of Model Architectures
We explored the many-to-many LSTM model in three deferment architectures: Stack bi-directional LSTM (SBLSTM), feedforward followed by bi-directional LSTMs (FF-BLSTM), and bi-directional LSTM followed by feedforward (BLSTM-FF), as shown in Figure 4. The structure of the SBLSTM as presented in prevalent work was used [12,56]. It employed only the BLSTM as the hidden layer. The FF-BLSTM and BLSTM are hybrids of the feedforward and BLSTM layers, respectively. The FF-BLSTM has lower hidden layers with a feedforward structure that is cascaded with the upper hidden layers through a BLSTM adopted from work on text-to-speech systems [11]. The BLSTM-FF uses the lower layer as a BLSTM and the upper layer as a feedforward structure. All models used linear activation in the output layer. We trained the models with the learning rate of 0.0001, a mini-batch size of 64. The best model for each architecture can be summarized as follows:  The RMSE and CORR of the best model in each network architecture are shown in Table 4. Note that the voiced/unvoiced flags were obtained from the baseline RNN-based speech synthesizer, the details of implementation of which are described in Section 4.5. As shown in Table 4, the number of model parameters of the models are quite different because we varied the number of hidden layers and units and selected the best performance for each network architecture. Table 4 shows that the SBLSTM architecture delivered the poorest performance, while the FF-BLSTM and BLSTM-FF architectures performed similarly. This indicates that using hybrid feedforward and LSTM layers could improve model performance. We selected the FF-BLSTM for the next evaluation. Table 4. Objective results of best model in each network architecture (±denotes standard deviation and RMSE denotes root mean-squared error).

Analysis of Linguistic Features
We investigated the influence of each linguistic feature set on predictive performance. We trained the F 0 models by using several combinations of linguistic feature sets. The best-performing RNN architecture (FF-BLSTM) with the same hyper-parameters was used. The RMSE and CORR of the models trained by different combinations of feature sets are shown in Table 5. The results show that the tonal feature set substantially improved prediction performance (PH_TN was better than PH). However, the combination of all features (PH_TN_PS_DU) achieved the best performance. This indicates that including a combination of phone and tonal feature sets improved prediction performance. We examined the effect of the sampling rate (K) of F 0 on prediction performance. Based on the distribution of the duration of syllables in Figure 5, we hypothesized that the appropriate value of K would be close to the mean of syllable duration. To test this, we trained the model by varying the value of K. The best model of the FF-BLSTM with the same hyper-parameters was used. The RMSE and CORR values are shown in Table 6. The model trained with K = 45 gave the best performance in terms of RMSE while the CORRs of all models were similar.

Speech Generation
To measure the perceptual performance of the models, the generation of the synthetic speech is required. We employed the FF-BLSTM model to predict spectral features based on [11]. The model was trained by using the input linguistic features adopted from the question set for training the model for HMM-based speech synthesis for Isarn [5]. The input feature vector consisted of 489-dimensional linguistic features: 472 dimensions of categorical linguistic contexts (e.g., phonemes identities, tone of syllable), 14 dimensions of numerical linguistic contexts (e.g., position of the current syllable in the current word, number of syllables in the current word), and 3 dimensions of frame-level features.
The frame-level input features were considered, including forward/backward positions of the given frame in the given phone and phone duration. The output feature vector comprised 196-dimensional acoustic features containing the 60-dimensional Mel-cepstral coefficients, 4-dimensional band aperiodicities, and log F 0 with their dynamic features and voiced/unvoiced flags. Similar to [11], we included all silence frames for training to preserve the continuity of acoustic features within a sentence. The input and output features were normalized to have zero mean and unit variance. We used the network structure of three feedforward layers, with 512 nodes per layer, where the top two hidden layers had a BLSTM structure. Each combined 128 forward and 128 backward nodes. This model was used to generate the speech parameters for all experiments.
In the synthesis stage, the sequence of input feature vectors was fed to the trained RNN-based acoustic model to produce the speech parameters. Then, these speech parameters were smoothened by using a parameter generation algorithm [52]. Following this, the output speech parameters were enhanced using a post-filtering algorithm [57] to improve the naturalness of the synthetic speech. Finally, the waveform of speech was generated through the speech vocoder.

Objective Evaluation
The objective evaluation measured the distortion between the original and generated However, we noticed that the RMSE and CORR of the models were slightly different. In this case, the perceptual evaluation is required to further measure the performance of the models because the objective result might not always be well correlated with perception of the listener [58]. Table 7. Objective results of proposed model compared with baseline models.

Subjective Evaluation
Typically, the objective evaluation is useful for training the model but does not reflect the perception of the listener [58]. Thus, we also conducted tests of subjective preference. As these tests were used to investigate the generation of F 0 by the models, the spectral parameters were generated using the same model, whereas F 0 was generated using different models. To force the listener to concentrate on the generation of F 0 , the phone duration was obtained from the transcription files because prosody is also dependent on the performance of the duration model.
A total of 30 native speakers participated in each test. All listeners speak fluently in the central Isarn dialect. In each test, the subjects were asked to listen to 20 pairs of utterances (some samples of synthetic speech with three F 0 models are available at https://isarn-samp-f0.github.io) randomly selected from the test set, and determined the item in each pair that sounded more natural, or chose a "no preference" option if they found the two utterances to be very similar. The order of the speech samples in each pair was swapped. The listeners were allowed to play back the recordings of the utterances as many times as they wanted before assigning a score to them. A t-test was used to show that the differences between the compared systems were significant (p < 0.01).
Three preference tests were conducted consisting of comparisons between the frame-based model and DCT-based model, between the SAMP-based model and frame-based model, and between the DCT-based model and SAMP-based model. Figure 6 shows the preference scores of the system pairs. It is clear that the preference score of the frame-based model was lower than those of the DCT-based and SAMP-based models, which were trained using syllable-level features, although the DCT-based model had recorded a poorer performance in the objective evaluation. This indicates that using syllable-level features is effective for learning the complex variations in tonal contours. However, the difference between the frame-based and DCT-based models was not significant (p = 0.3067). Considering the representation of F 0 , the SAMP-based model was significantly better than the DCT-based model (p = 0.005). This indicates that the sampling-based method can provide a better representation of the F 0 contour for the Isarn speech corpus. To demonstrate the effect of using the proposed syllable-level features, Figure 7 shows a comparison of the reference F 0 contour and the F 0 contours generated by the three systems using the sentences "ฮ  ว ย ม า เ บ ิ ่ ง ม า เ บ ิ ่ ง ส ิ อ  า น ใ ห  เ ง ิ น ป ล อ ม อ ี ห ล ี บ  "/h u ː â j ma ᷇ ː b ә ŋ ma ᷇ ː b ә ŋ s ì ɂ á ː n h à j ŋ ә ᷇ n p ↄ ː m ɂ í ː l ǐ ː b ↄ ː / ("Hey, come to see, I am checking for counterfeit money." in English translation) and "ย า ม ค ว า ย เ จ  า อ อ ก ล ู ก เ จ  า อ ย  า ล ื ม ถ  า เ อ า น  อ ง ค ว า ย น ำ เ ด  อ ข  อ ย ม ั ก ก ิ น ต  ม น เ จ  า อ ย  า ล ื ม ถ  า เ อ า น  อ ง ค ว า ย น ำ เ ด  อ "/ a ᷇ ː m k ʰ u ː a ᷇ j c â w ɂ ↄ ː k l û ː k c â w j á ː l ɯ᷇ ː m t ʰ à ː ɂ a w n ↄ ː ŋ k ʰ u ː a ᷇ j n a ᷇ m d ә ᷇ ː k ʰ ↄ j ma k k i n t ô m n ↄ ː ŋ k ʰ u ː a ᷇ j || a ᷇ ː m k ʰ u ː a ᷇ j c â w ɂ ↄ ː k l û ː k c â w j á ː l ɯ᷇ ː m t ʰ à ː ɂ a w n ↄ ː ŋ k ʰ u ː a ᷇ j n a ᷇ m d ә ᷇ ː k ʰ ↄ j ma k k i n t ô m n ↄ ː ŋ k ʰ u ː a ᷇ j a ᷇ ː m k ʰ u ː a ᷇ j c â w ɂ ↄ ː k l û ː k c â w j á ː l ɯ᷇ ː m t ʰ à ː ɂ a w n ↄ ː ŋ k ʰ u ː a ᷇ j n a ᷇ m d ә ᷇ ː k ʰ ↄ j ma k k i n t ô m n ↄ ː ŋ k ʰ u ː a ᷇ j / ("When your buffalo gives birth, you don't forget to get its placenta." in English translation). As shown in Figure 7a, the generated F 0 contour using the proposed model was more appropriate than those of the baseline systems at both the syllable level (e.g., from frame 300 to 350) and the utterance level (e.g., from 75 to 275). Figure 7b demonstrates inappropriate contours of F 0 generated by the DCT-based method (e.g., frame 250 to 300). These results were obtained because the DCT-based method might have generated sub-par values for some coefficients that caused the overall F 0 contour to deviate. Comparison of F 0 contours generated using frame-based, DCT-based, and SAMP-based models, (a) the sentence means "Hey, come to see, I am checking for counterfeit money." in English, (b) the sentence means "When your buffalo gives birth. You don't forget to get its placenta." in English.

Conclusions and Future Work
In this paper, we proposed an RNN-based F 0 model for Isarn speech synthesis. The model can generate F 0 contours at the syllable level instead of the frame level. The F 0 contour within the syllable was represented by sampled values of F 0 and their dynamic features. To achieve the best performance, we investigated the performance of several model architectures: The SBLSTM, FF-BLSTM, and BLSTM-FF. Based on an objective test, a hybrid of feedforward and BLSTM delivered the best performance. We compared the optimized model with the frame-based model and the DCT-based model in terms of the representation of F 0 . The objective results of the proposed method and baseline were slightly different. However, the results of subjective tests showed that the proposed model significantly outperformed the baseline systems. This suggests that modeling F 0 at the syllable level using the proposed sampling-based method of representation of F 0 was effective for learning the complex variation in tonal contours. In future work, we will emphasize the generation of phoneme duration to improve the naturalness of synthesized speech.