Sequence-to-Sequence Acoustic Modeling with Semi-Stepwise Monotonic Attention for Speech Synthesis

: An encoder–decoder with attention has become a popular method to achieve sequence-to-sequence (Seq2Seq) acoustic modeling for speech synthesis. To improve the robustness of the attention mechanism, methods utilizing the monotonic alignment between phone sequences and acoustic feature sequences have been proposed, such as stepwise monotonic attention (SMA). However, the phone sequences derived by grapheme-to-phoneme (G2P) conversion may not contain the pauses at the phrase boundaries in utterances, which challenges the assumption of strictly stepwise alignment in SMA. Therefore, this paper proposes to insert hidden states into phone sequences to deal with the situation that pauses are not provided explicitly, and designs a semi-stepwise monotonic attention (SSMA) to model these inserted hidden states. In this method, hidden states are introduced that absorb the pause segments in utterances in an unsupervised way. Thus, the attention at each decoding frame has three options, moving forward to the next phone, staying at the same phone, or jumping to a hidden state. Experimental results show that SSMA can achieve better naturalness of synthetic speech than SMA when phrase boundaries are not available. Moreover, the pause positions derived from the alignment paths of SSMA matched the manually labeled phrase boundaries quite well.


Introduction
Statistical parametric speech synthesis (SPSS) [1] is a mainstream approach to speech synthesis currently. It consists of three main components: text analysis, acoustic modeling, and waveform reconstruction. Text analysis [2] extracts linguistic features, such as phone transcriptions and prosodic structures, from input texts. Acoustic modeling aims to represent the mapping relationship between linguistic and acoustic features using statistical models [3]. Vocoders [4,5] are utilized to reconstruct speech waveforms from the predicted acoustic features at the synthesis time. Recently, neural-network-based sequence-to-sequence (Seq2Seq) acoustic models such as Tacotron [6] and Tacotron2 [7], and neural vocoders such as WaveNet [5] have been proposed and improved the naturalness of SPSS significantly.
The attention mechanism imitates the human brain. For example, human vision can quickly scan the image to obtain the target area that needs to be focused on, and then put more attention on this area to obtain more detailed information about the target. The Seq2Seq acoustic models also uses an attention mechanism to bridge the encoder and decoder, then the decoder pays attention to different parts of the input text to generate the corresponding acoustic features of each frame. One issue with the original Tacotron is that its attention mechanism is not robust enough, which may lead to errors in predicted acoustic features, such as repeating, skipping, and attention collapse. Repeating refers to the fact that too much attention stays on a certain input, resulting in a stuttering feeling in the synthetic speech. Skipping refers to the fact that too little attention stays on a certain input, resulting in missing words in the synthetic speech. Attention collapse refers to the fact that the attention mechanism does not know which part in the text to pay attention to, resulting in unintelligible synthetic speech. One approach to alleviate this issue is to modify the attention mechanism utilizing the monotonic property of the alignment between phone sequences and acoustic feature sequences. Some improved attention techniques, such as forward attention [8] and stepwise monotonic attention (SMA) [9], have been proposed. In SMA, alignment paths were constrained to be strictly stepwise and monotonic, which meant that the attention at each decoding step can only choose staying at the same phone or to moving forward to the next phone, without moving backward and skipping. This strategy improved the robustness of Seq2Seq speech synthesis effectively.
On the other hand, phrase boundaries [10] are important prosodic labels for speech synthesis and they are usually indicated by pauses in continuous speech. Figure 1 shows the waveform and the aligned transcription for an example sentence. We can see that each phrase boundary (<pb>) corresponds to a short pause (sp) in this sentence. However, the phone sequences derived by grapheme-to-phoneme (G2P) conversion [11] may not contain the pauses at the phrase boundaries in utterances, considering the costs of labeling phrase boundaries at the training stage and predicting them at the synthesis time, especially for low-resource languages. The lack of explicit pause positions challenges the assumption of strictly stepwise alignment in SMA and may degrade the quality of synthetic speech when phrase boundaries are not available. The English translation of this sentence is "The weak schools with low funding, poor conditions and low salaries are unable to retain talents. Some middle schools even need external teachers for Chinese and mathematics classes".
Therefore, this paper proposes to insert hidden states into phone sequences to deal with the situation that pauses are not provided explicitly, and designs a semi-stepwise monotonic attention (SSMA) to model these inserted hidden states as the standard SMA cannot handle these states very well. In this method, hidden states are employed to absorb the pause segments in utterances using an unsupervised way. In comparison with SMA, the attention of SSMA at each decoding frame has three options for the next frame, including moving forward to the next phone, staying at the same phone, or jumping to the hidden state. Experimental results show that SSMA outperformed SMA in both objective and subjective evaluations when phrase boundaries are not given. Furthermore, the F1 score between the pause positions derived from the alignment paths of SSMA and the manually labeled phrase boundaries was 56.22%, which demonstrated the ability of SSMA on learning phrase boundaries without supervision.
The paper is organized as follows. Section 2 briefly review the existing stepwise monotonic attention mechanism. Section 3 introduces our proposed method. Sections 4 and 5 are experimental results and conclusions.

Related Work
The Tacotron [6] model unified acoustic modeling and duration modeling in a single model, and adopted a simple additive attention mechanism [12] to calculate the attention weights by query and keys. Further, Tacotron2 [7] used a location-sensitive attention mechanism [13], but there were still alignment errors, especially for out-of-domain texts. The alignment errors led to robustness problems in the predicted acoustic features, such as repeating, skipping, and failing to stop. To alleviate these problems, a number of methods have been proposed, including non-autoregressive acoustic models with explicit phone duration modeling [14,15] and improved attention mechanisms, such as forward attention [8], stepwise monotonic attention (SMA) [9] and location-relative attention [16].
Among them, SMA [9] applied strictly stepwise and monotonic constraints to alignment paths. Its mechanism is described in Algorithm 1, where W, V , U, v and b are trainable weights, G denotes weights of convolution kernels, σ is sigmoid function, θ is the Heaviside step function, and γ is a trainable weight to control the strength of Gaussian noise.

Algorithm 1: Stepwise Monotonic Attention (SMA).
Input: query vector q t , key vectors At the first decoding step, the attention weights are manually set as one for the first phone, and zero for the rest phones considering the phone sequence is monotonically aligned with the acoustic feature sequence. Starting from the second decoding step, the attention weights are calculated recursively. The SMA calculates energy value e t,n based on query value q t in the decoder, key values K = {k 1 , k 2 , . . . , k N } in the encoder outputs and location features F. Then "selection probability" p t,n based on energy value is computed by a sigmoid function. To sharpen the probability of the output, SMA adds Gaussian noise to energy values before feeding them into the sigmoid function. In order to achieve a monotonic and non-skipping attention mechanism, the attention weights a t are calculated based on the previous attention weights a t−1 and "selection probability" p t as shown in Algorithm 1. SMA adopted probability-based soft alignment at the training stage, and can choose between soft and hard alignments at the synthesis stage.

Proposed Methods
We propose to insert a hidden state between every two adjacent phones. To deal with these inserted hidden states, we propose a semi-stepwise monotonic attention (SSMA), as shown in Figure 2. The SSMA mechanism is a modification of the SMA mechanism, that can deal with the situation that there are missing pause labels in phone sequences due to the lack of explicit phrase boundaries. These hidden states are expected to absorb the decoding frames that correspond to pauses. We name it "semi-stepwise" because these hidden states are skippable if there is no pause between two adjacent phones, thus the alignment paths are not strictly stepwise.  At the t-th decoding step, a tn denotes the attention weight of the n-th phone, and l tn denotes the attention weight of the hidden state after the n-th phone. We have ∑ n (a tn + l tn ) = 1. As shown in Figure 2, the attention at the n-th phone of the (t − 1)-th decoding step has three choices to derive the attention of the t-th decoding step, staying at current phone, jumping forward to the next phone, or jumping to the hidden state after current phone. Let p c→c t,n ,p c→n t,n and p c→h t,n denote the probability of these three choices, respectively, and their sum should be 1. Moreover, the attention at a hidden state of the (t − 1)-th decoding step has two choices to derive the attention of the t-th decoding step, staying at current hidden state or jumping to the next phone. Let p h→h t and 1 − p h→h t denote the probability of these two choices, respectively.
Similar to Algorithm 1 for SMA, the attention weights a t = {a t1 , a t2 , . . . , a tN } and l t = {l t1 , l t2 , . . . , l t,N−1 } are calculated in a recursive way. The detailed pseudo-code is shown in Algorithm 2, where W, V , v, b, W l , V l , v l , b l , U, V and b are trainable weights.
When t = 1, the attention weights are manually set as one for the first phone, and zero for the rest phones and hidden states. Considering that a phone at the t-th decoding step has three options to choose the attention for the next decoding step, a two-level prediction strategy based on location-sensitive attention [13] is adopted to calculate the probabilities of these options. Specifically, we first predict the staying probability p c→c t and then predict the proportion of p c→n t ./(p c→n t + p c→h t ), where ./ denotes element-wise division. In order to enhance the query effectiveness, a query matrix q t is calculated using a deep neural network (DNN). The DNN model accepts three inputs, i.e., the query q t , the keys K and the location vectors F. The probability p c→c t,n is calculated by DNN using q t,n and k n . The probability p c→n t,n /(p c→n t,n + p c→h t,n ) is calculated by the DNN using q t,n and k n+1 . An endof-sentence (EOS) embedding vector is taught to replace k n+1 when n is index of the last phone. Then, p c→n t and p c→h t can be derived from p c→c t and p c→n t ./(p c→n t + p c→h t ). In our preliminary experiments, we found that this strategy can obtain higher naturalness of synthetic speech than using Gumbel-Softmax [17]. To calculate the staying probability on hidden states p h→h t , we simply use a DNN. The DNN model accepts two inputs, i.e., the query vector q t and the hidden state embedding k l . The DNN first predicts e l t , and then the trick of adding-noise and a sigmoid function are applied again to calculate p h→h t . Finally, the attention weights a t and l t are calculated recursively and the context vector c t is updated accordingly.

Algorithm 2: Semi-Stepwise Monotonic Attention.
Input: query vector q t , key vectors K={k 1 , k 2 , . . . , k N }, hidden state embedding vector k l , previous attention weights a t−1 and l t−1 , mode ∈ {hard, so f t} Output: attention weights a t = {a t1 , a t2 , . . . , a tN } and l t = {l t1 , l t2 , . . . , l t, e s t,n ← v tanh q t,n + V k n + b; e j t,n ← v tanh q t,n + V k n+1 + b; if mode = soft then p c→c t,n ← σ(e s t,n + γN (0, 1)); p c→n t,n ← (1 − p c→c t,n )σ(e j t,n + γN (0, 1)); p c→h t,n ← (1 − p c→c t,n )(1 − σ(e j t,n + γN (0, 1))); p h→h t ← σ(e l t + γN (0, 1)); else if mode = hard then p c→c t,n ← σ(e s t,n ); p c→n t,n ← (1 − p c→c t,n )σ(e j t,n ); p c→h t,n ← (1 − p c→c t,n )(1 − σ(e j t,n )); {p c→c t,n , p c→n t,n , p c→h t,n } ← binary(p c→c t,n , p c→n t,n , p c→h t,n ); p h→h t ← θ(e l t ); end end a t ← a t−1 · p c→c t + 0; a t−1,1:N−1 · p c→n t,1:N−1 + 0; (1 − p h→h t )l t−1 ; l t ← p h→h t l t−1 + a t−1,1:N−1 · p c→h t,1:N−1 ; // {a t ; l t } are normalized. end c t ← ∑ N n=1 a t,n k n + k l ∑ N−1 n=1 l t,n ; return a t , l t , c t ; At the training stage, p c→c t,n , p c→n t,n , p c→h t,n and p h→h t are computed in the probabilitybased soft mode. In the inference stage, the hard mode is adopted. In the hard mode, the maximum value among p c→c t,n , p c→n t,n and p c→h t,n is set to 1 and the other two values are set to 0, as indicated by the binary function in Algorithm 2. p h→h t also becomes 0 or 1.

Experimental Setup
A Chinese corpus pronounced by a female speaker was used in our experiments. The scripts were selected from newspapers, and the recordings were sampled at 16 kHz with 16 bits resolution. The total 12,319 utterances (≈17.51 h) were split into a training set of 11,608 utterances, a validation set of 611 utterances and a test set of 100 sentences. The training set was used to train acoustic models and the validation set was used to tune hyperparameters.
A publicly available implementation of Tacotron2 (https://github.com/NVIDIA/ tacotron2, accessed on 12 June 2020) was utilized as the basis of our implementation. When training the model, 80-band mel-spectrograms were used as the acoustic features. The frame length was 64 ms and the frame shift was 15 ms. Phone sequences were adopted as model input and the initials and finals of Mandarin Chinese were treated as phones for simplification. A phone embedding vector, a tone embedding vector and a prosodic position embedding vector were concatenated to represent each phone. The Adam optimizer [18] was used, the training epochs were 200 and the training batch size was 80. The initial learning rate was 1 × 10 −3 , and then the learning rate exponential decayed by 0.9 times every 10 epochs. A WaveNet vocoder was built to reconstruct waveforms in our experiments.

SMA-PB
The attention mechanism was stepwise monotonic attention (SMA) [9]. In both training and synthesis stages, the phrase boundaries of texts were not available. The initial bias b was 3.5 and the noise scale γ was 2.0. The soft mode was used at the training stage, while the hard mode was used at the synthesis stage. SSMA-PB This model adopted SSMA instead of SMA and other model structure and hyperparameters were the same as SMA-PB. In both training or synthesis stage, this model used the same data as SMA-PB. The initial bias b and b l was 3.5, and the noise scale γ was 2.0. The soft mode was used at the training stage, while the hard mode was used at the synthesis stage. SMA+PB This model had the same structure and hyperparameters as SMA. In both training and synthesis stages, this model used the texts with manually labeled phrase boundaries.

Objective Evaluation
The objective performance of SMA-PB, SSMA-PB, and SMA+PB on predicting the acoustic features of test sentences was evaluated. The metrics included mel-cepstral distortion (MCD), F0 root mean square error (RMSE), F0 correlation (CORR), and unvoiced/voiced (UV) error percentage. The frame-level MCD was calculated as: where c r and c s are mel-cepstral coefficients (MCCs) from natural and synthetic speech, respectively, and M is their order. The frame-level F0 RMSE was calculated as: where F r and F s represents F0 values extracted from natural and synthetic speech, respectively. The F0 CORR was defined as the F0 values correlation coefficient between synthetic speech and natural speech in the voiced segment. The UV error percentage was the ratio of the number of unmatched U/V frames between natural and synthetic speech to the total number of frames. For calculating the four metrics, twelve-dimensional MCCs, and F0 values were extracted from synthetic speech at 5 ms frame shift by STRAIGHT [19] analysis. The FastDTW algorithm [20] based on MCCs was adopted to align predicted acoustic features toward reference ones for calculating the four metrics. The results are shown in Table 1. From this figure, we can see that SMA+PB achieved the best accuracy of acoustic feature prediction. This is reasonable since it utilized manually labeled phrase boundaries in both training and synthesis stages. Comparing SMA-PB with SMA+PB, it can be found that the objective performance of SMA degraded significantly when phrase boundaries were not available. Comparing SMA-PB, SSMA-PB achieved smaller MCD and comparable F0 distortion, which may be that the MCD was more related to pause segments in utterances.

Subjective Evaluation
Twenty sentences with at least one phrase boundary were randomly selected from the test set. These phrase boundaries were not used when synthesizing them with SMA-PB and SSMA-PB. The utterances synthesized using the SMA-PB, SSMA-PB, and SMA+PB systems were compared by two groups of AB preference tests on their naturalness. In each test, the synthetic utterances of two systems were evaluated in random order by 11 native listeners. The listeners were asked to judge which sentence in each pair sounded more natural or there was no preference. The average preference scores are shown in Table 2. Each row in the Table 2 compares whether there are significant differences in different systems. The percentage of the better system is shown in bold format. From this table, we can see that the SSMA-PB system outperformed the SMA-PB system significantly with p < 0.001. This result indicates that using SSMA helped Tacotron2 to synthesize speech with better naturalness when phrase boundaries were not given in both training and synthesis stages. On the other hand, the subjective performance of SSMA-PB was still not as good as SMA+PB which utilized manually labeled phrase boundaries (p < 0.001).

Discussions
To explore the interpretability of our proposed SSMA method, two experiments were conducted to evaluate the consistency between the pause positions derived from SSMAbased alignment and manually labeled phrase boundaries.

Predicting Phrase Boundaries from Texts
This experiment evaluated the accuracy of predicting phrase boundaries from texts using the pause positions determined by SSMA at the inference stage. The sentences in the test set were synthesized by SSMA-PB with hard mode. If the hidden state between two adjacent phones was assigned more than one frame at the decoding time, a phrase boundary was predicted between these two phones. The hidden states adjacent to silence phones were not considered. Evaluation metrics included the precision, recall, and F1 score of predicting phrase boundaries. The results are shown in the first row of Table 3. We can see that SSMA-PB achieved a recall of 73.61%, which means that most true phrase boundaries can be found by SSMA-based decoding. However, its precision was much lower. One possible reason is that there were short pauses determined by SSMA, which may not correspond to true phrase boundaries. It should be noticed that SSMA-PB predicted phrase boundaries in an unsupervised way, i.e., no phrase boundary annotations were utilized at the training stage.

Annotation Phrase Boundaries by Forced Alignment
This experiment evaluated the accuracy of annotating phrase boundaries by SSMAbased forced alignment when both texts and recordings were given. In this case, the decoder was conducted in a teacher-forcing way, which means that the true history of the mel-spectrogram was used as input at each decoding step. After alignment paths were calculated, the phrase boundaries were annotated following the conditions used in previous experiment. The results of phrase boundary annotation are shown in the second row of Table 3, where the same metrics used in previous experiment were employed. We can see that SSMA-based phrase boundary annotation achieved higher precision, recall, and F1 score than SSMA-based phrase boundary prediction. This is reasonable because the former utilized both textual and acoustic information. The F1 score of SSMA-PB on annotating phrase boundaries was 56.22%, which shows that without supervision those taught hidden state positions in SSMA did have a strong correlation with manually labeled phrase boundaries.

Conclusions
This paper has proposed a semi-stepwise monotonic attention (SSMA) method to improve the performance of sequence-to-sequence (Seq2Seq) speech synthesis when phrase boundaries are not available in both training and synthesis stages. In this method, hidden states are added between adjacent phones to model the possible pauses between them. Thus, the attention to a phone at each decoding step has three options for the next decoding step, moving forward to the next phone, staying at the same phone, or jumping to a hidden state. Then, an algorithm was designed to calculate the attention weights of SSMA in a recursive way. Experimental results show that SSMA achieved better subjective performance than SMA when phrase boundaries are not available, which is quite suitable for low-resource languages that lack phrase boundaries. To improve the accuracy of SSMAbased unsupervised phrase boundary annotation and to evaluate our proposed method using the datasets of more languages will be the tasks of our future work.