Semi-Supervised Learning for Robust Emotional Speech Synthesis with Limited Data

Zhang, Jialin; Wushouer, Mairidan; Tuerhong, Gulanbaier; Wang, Hanfang

doi:10.3390/app13095724

Open AccessArticle

Semi-Supervised Learning for Robust Emotional Speech Synthesis with Limited Data

Xinjiang Multilingual Information Technology Laboratory, Xinjiang Multilingual Information Technology Research Center, College of Information Science and Engineering, Xinjiang University, Urumqi 830017, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2023, 13(9), 5724; https://doi.org/10.3390/app13095724

Submission received: 6 April 2023 / Revised: 29 April 2023 / Accepted: 4 May 2023 / Published: 6 May 2023

Download

Browse Figures

Versions Notes

Abstract

:

Emotional speech synthesis is an important branch of human–computer interaction technology that aims to generate emotionally expressive and comprehensible speech based on the input text. With the rapid development of speech synthesis technology based on deep learning, the research of affective speech synthesis has gradually attracted the attention of scholars. However, due to the lack of quality emotional speech synthesis corpus, emotional speech synthesis research under low-resource conditions is prone to overfitting, exposure error, catastrophic forgetting and other problems leading to unsatisfactory generated speech results. In this paper, we proposed an emotional speech synthesis method that integrates migration learning, semi-supervised training and robust attention mechanism to achieve better adaptation to the emotional style of the speech data during fine-tuning. By adopting an appropriate fine-tuning strategy, trade-off parameter configuration and pseudo-labels in the form of loss functions, we efficiently guided the learning of the regularized synthesis of emotional speech. The proposed SMAL-ET2 method outperforms the baseline methods in both subjective and objective evaluations. It is demonstrated that our training strategy with stepwise monotonic attention and semi-supervised loss method can alleviate the overfitting phenomenon and improve the generalization ability of the text-to-speech model. Our method can also enable the model to successfully synthesize different categories of emotional speech with better naturalness and emotion similarity.

Keywords:

speech synthesis; low resource; emotional speech corpus; transfer learning; pseudo label

1. Introduction

Speech is a fundamental form of human communication. Text-to-Speech (TTS), as a core technology in the field of human–computer interaction, aims to convert text into intelligible speech output by machine processing. The landmark model WaveNet [1] brought TTS technology into the era of deep learning, and current neural TTS systems are capable of producing relatively natural speech using large amounts of paired text and speech data for training, such as the Tacotron1 [2] and Tacotron2 [3] published by Google’s DeepMind team, and the DeepVoice1, 2, 3 [4,5,6] released by Chinese Baidu company, and also parallel TTS models-FastSpeech2, 2s [7,8] proposed by Microsoft Asia Research Institute are some of the common classical models for speech synthesis. However, these models have focused more on synthesizing smooth, neutral speech but not on synthesizing speech with clear emotion and rich rhythmic expression. The reality is that it is difficult and expensive to collect annotated emotion data sets, and mature deep-learning models cannot be trained without large amounts of data. Effective emotional speech synthesis techniques can help improve the performance of various related applications, such as TikTok for voice navigation and post-application voiceover tools. Apparently, more expressive emotional speech closer to the real human voice will greatly improve the willingness of users to interact.

Emotional speech data from real application scenarios are insufficient to support the complete training of a neural network-based end-to-end speech synthesis model from start to finish, and many studies have shown that transfer learning can reduce the amount of data required for the final target task. This idea was also validated in the emotional speech synthesis study by Noé Tits et al. [9], where the synthesized speech was objectively perceived by an emotional speech recognition model. However, due to the sparse amount of emotional speech data used for fine-tuning, the model, after parameter updating, suffers from overfitting and unstable sound quality. The model is affected by the non-linguistic rhythmic features of emotional speech, and the synthesis of test texts outside the training set (out-of-domain) is not satisfactory. Because the attention module used does not adapt well to the new speaking style of emotional speech. The mean opinion score (MOS) of the synthesized emotional speech is below 3 points, which is obviously not good enough for listening; thus, the sound quality still needs to be improved. Zhou K et al. [10] also did some research under limited emotional speech data conditions; they used the GST [11] model as a style encoder, first pre-trained to maturity on a multi-speaker dataset, and then fine-tuned to act as an emotional encoder with small amount of emotional speech, and achieved emotional voice conversion under low data resource conditions through these two stages of training. Their method allows the TTS model to learn the actual emotional information.

To address the above issues, we use a pre-trained and mature neutral Tacotron2 model fine-tuned with an emotional speech dataset that does not require manual screening. In order to enhance the alignment learning of the sequence-to-sequence (Seq2Seq) model, the original location-sensitive attention (LSA) mechanism is replaced by a stepwise monotonic attention mechanism during training, and the pre-trained backbone network is used to provide the pseudo-labels. The pre-trained HiFi-GAN [12] vocoder is finally used to synthesize the target’s emotional speech.

The main contributions of this paper include three aspects.

(1): An emotional speech synthesis method based on transfer learning and pseudo-labeled reference loss is proposed, in which the model is adaptively learned to generate emotionally perceptible speech with limited emotional speech data after pre-training the model with neutral speech data and then fine-tuning it with emotional speech data that does not contain non-verbal expressions.
(2): Based on the temporal characteristics of Seq2Seq model alignment learning, the attention mechanism used in the initial emotional speech synthesis base framework was modified to add monotonic constraints to strengthen the alignment process during speech synthesis training, allowing for more stable rhyme synthesis and more emotional expressions.
(3): To alleviate the overfitting phenomenon of the model under low resource conditions, we analyze the parameter updates of different modules, adjust and formulate the training strategy when fine-tuning the model, add pseudo-label loss to achieve the effect of regularization, and finally synthesize emotional speech that is better than the baseline method in terms of naturalness, robustness and emotional expressiveness.

2. Related Work

Transfer learning [13] in the study of low-resource problems usually uses the knowledge that was learned on the source task with a large amount of training sample data to improve the learning on the target task, as shown schematically in Figure 1. This idea has also been successfully applied to solve the problem of limited data resources for target tasks in the speech domain. In a research work on neural automatic speech recognition (ASR) as a feature extractor for emotion recognition, the article [14] shows that the mapping between speech and text learned by the ASR system through knowledge transfer methods contains information useful for emotion recognition.

The study [15] successfully transferred knowledge from a model for training speaker classification to a multi-speaker TTS model. Then another study transferred knowledge from speaker verification to multi-speaker speech synthesis with feedback constraint [16]. Regarding speech synthesis tasks in low-resource scenarios, there are many approaches in recent years that have achieved good results in research related to speaker adaptation [17,18,19,20] and also incorporating cross-speaker style transfer for multi-language TTS [21]. Transfer learning can also help synthesize languages such as small languages where the amount of available training speech data is limited, such as Lombardy dialects [22], Indian Sanskrit [23], Russian [24], and even emotional Mongolian [25].

The Knowledge Distillation (KD) approach, originally proposed by Hinton et al. [26], transfers knowledge from a large teacher network to a smaller student network. It works by training students to predict target classification labels and mimic the classification probabilities of the teacher model, as these features contain additional information about how the teacher generalizes, allowing the student model to obtain useful and relevant information from the teacher model. In a traditional teacher-forced training model, the input to the decoder at each time step

t

is the output of the real speech (ground truth) at the previous time step

y_{t - 1}

which means the spectrogram frame. Feeding the ideal input into the decoder during the training phase causes exposure bias problems because the noisy output

y_{t - 1}^{*}

generated at the previous time step is used in the testing phase. Inspired by the article [27], Lee et al. [28] decided to add noise in the training phase with the average of the input

y_{t - 1}

and the output

y_{t - 1}^{*}

, and this method is called as semi-teacher-forced training (STFT). The mismatch between the training and inference phases exposes the bias problem, which causes the model to have unpredictable errors during inference and accumulates errors frame by frame along the time axis. It is also because of the privileged information about the similarity of class distribution between the teacher model and the student model that Liu Rui et al. [29], researchers from Inner Mongolia University, successfully applied the method to solve the autoregressive model by introducing a distillation loss function in addition to the feature loss function.

Pseudo-labeling methods are a supervised paradigm for learning from both unlabeled and labeled data, usually by using the class with the highest prediction probability as the pseudo-label, which may not be the true target class [30]. Pseudo-labeling can alleviate the need for human hand-crafted labels. In the training phase, pseudo-labels and labels are applied to train the new model in a supervised mode. For unlabelled data, the pseudo-labels are recalculated for each weight update and used to supervise model-tracking tasks with the same loss function. Since the amount of different data varies greatly in terms of data size, the balance of different data is important for the performance of the final training model. Higuchi et al. [31] experimented with ASR using pseudo labels, and experiments showed improved results for text recognized using untranscribed audio.

Semi-supervised training methods can truly improve the data efficiency in end-to-end speech synthesis [32]. While the task for TTS is generally in supervised mode, a study by the Ping An Technology team [33] showed that a semi-supervised learning method based on generating pseudo tags could free up exponentially more data and greatly reduce training costs. This semi-supervised learning scheme using pseudo labels to further guide the acoustic model learning significantly improves the speech quality of the test data and achieves naturalness and robustness of speech synthesis under limited emotional speech data conditions. This approach was inspired by the research in image processing, where Xie. Q et al. [34] proposed an efficient method to pre-train an EfficientNet model on labeled ImageNet images and use it as a teacher model to generate pseudo-labels on 300 million unlabeled images and then combine the labeled and pseudo-labeled image data to complete the self-training of the student model.

Seq2Seq models with attention mechanisms are more effective in modeling long-term dependencies at different temporal levels, such as words, phrases, and discourse [10]. By learning attention alignment, the sequence-to-sequence model can capture the dynamics of rhythm in discourse and also predict speech duration, which is a key rhythmic factor affecting speech emotions, such as longer duration in sad, relaxed, and tired states and faster and shorter duration in surprised and joyful emotions [35]. The formula for the attentional mechanism is shown in Equations (1)–(3).

e_{i, j =} A t t e n t i o n (h_{i - 1}, x_{j}),

(1)

α_{i, j} = \frac{e x p (e_{i, j})}{\sum_{k = 1}^{n} e x p (e_{i, k})} = s o f t m a x {(e_{i, :})}_{j},

(2)

c_{i} = \sum_{j = 1}^{n} α_{i, j} x_{j} .

(3)

In which the encoder output

x = [x_{1}, x_{2}, \dots, x_{n}]

is used as a memory strip, and at each time step

i

, the trainable attention mechanism is able to evaluate the energy value for each

x_{j}

by the hidden state

h_{i - 1}

given by the previous decoder. The energy values are then normalized to obtain the alignment vector

α_{i}

to generate the content vector

c_{i}

.

In order to obtain better performance, many variants of attention mechanisms have emerged for application to speech synthesis research. The guided attention (GA) mechanism proposed by the DCTTS [36] model accelerates convergence by penalizing non-diagonal alignments, while the location-sensitive attention (LSA) mechanism is used in Tacotron2. LSA uses the alignment information from previous steps to the current inference alignment information to reduce repetition and omission problems, implicitly encouraging monotonicity and completeness. Due to the nature of monotonic alignment between phoneme sequences and acoustic feature sequences, the forward attention mechanism [37] (FA), monotonic attention mechanism [6] (MA), and dynamic convolutional attention mechanism [38] (DCA) have been successively applied to TTS models with Seq2Seq properties those have achieved good results. To ensure completeness, the stepwise monotonic attention mechanism [39] (SMA) adds a restriction to MA; the hard-aligned position can only be moved at most one step in each decoding step, the alignment vector is recomputed by Equations (4)–(6):

p_{i, j} = s i g m o i d (e_{i, j}),

(4)

z_{i, j} : B e r n o u l l i (p_{i, j}),

(5)

α_{i, j} = α_{i - 1, j - 1} (1 - p_{i, j - 1}) + α_{i - 1, j} p_{i, j}

(6)

The set “selection probability”

p_{i, j}

, which indicates the zero-patch operation. The experimental results of this study show that the quality of the synthesized speech using the TTS model with SMA is better than that of the baseline methods using LSA, MA, FA, GA, etc.

3. Proposed Methods

In this section, we focus on the structure of the underlying model used for emotional speech synthesis under low-resource conditions. The training process is described in Algorithm 1.

Algorithm 1 Semi-supervised Training Algorithm of SMAL-ET2

Input: Datasets: LJ-Speech, ESD, BC2013

{SMAL-ET2} ← initialization with random weights

Output: The trained Proposed Emotional Speech Synthesis Model

Pre-training:

1: Pre-train initial SMAL-ET2 model with LJ-Speech as backbone network

2: Pre-train backbone network with BC2013 as reference model to provide pseudo labels

3: Pre-train the HiFi-GAN model with LJ-Speech as vocoder

Fine-tuning:

1. Fine-tune backbone network with ESD subset as fine-tune model, while fix the parameters of decoder’s post-net and HiFi-GAN vocoder

2. Update the weights and recalculate the final loss with pseudo labels

3.1. Emotional Speech Synthesis Method

The TTS model used in the proposed emotional speech synthesis method is a modified Tacotron2 model-SMAL-ET2. Similar to the Tacotron2 model, it consists of two main parts: a Seq2Seq feature prediction network and a vocoder that converts the Mel-spectrum into a time-domain waveform. The LSA module of the original model is used in emotional speech experiments, and the sequence alignment learning for the input of unseen contexts during the initial training of the model, the attention mechanism will have various crashes, skips, repeats, and other errors, which limits the wider application of the model. Therefore, in order to adapt the model to the new emotional style when fine-tuning, SMA is chosen to replace the original LSA module during model training so that the attention module can be more applicable to the new emotional speech style. Moreover, the HiFi-GAN vocoder is more efficient and has higher fidelity than the WaveNet model, and the MOS score is better than the WaveNet model. The overall network structure of SMAL-ET2, the basic TTS model for emotional speech synthesis tasks under low resource conditions, is shown in Figure 2.

3.2. Training Process

The model was pre-trained using the LJ-Speech dataset (https://keithito.com/LJ-Speech-Dataset/, accessed on 7 January 2022), containing nearly 24 h of speech from a female speaker and the corresponding text. Another expressive dataset without emotional annotates is the audio dataset of the Blizzard Challenge 2013 (https://www.synsig.org/index.php/Blizzard_Challenge_2013, accessed on 15 February 2022), named BC2013, has a total speech duration of 16.8 h and contains 8248 utterances, with which the reference model is pre-trained to adapt the model to expressive speech in order to provide pseudo-label loss during fine-tuning. The model is fine-tuned with the neutral dataset of the speaker before being fine-tuned with the sentiment subset of the ESD (Emotional Speech Dataset) (https://github.com/HLTSingapore/Emotional-Speech-Dataset/, accessed on 16 February 2022) in order to adapt it to the style of the new speaker and then fine-tuned with the happy, sad, surprise, and angry subsets, respectively.

When fine-tuning the model with a single emotional speech dataset, the limited training of the target data leads to an overfitting phenomenon, where only three thousand iterations fail to produce intelligible speech, and the audio is completely noisy, while even after nearly 40,000 iterations, it still fails to produce the correct sound for utterances outside the training set. To alleviate the overfitting phenomenon, the approach in this paper makes the pre-trained model the initial reference model and further guides the training of the fine-tuned model after using its generated pseudo-label. The main purpose is to introduce a reference loss function to recalculate the mean squared error (MSE) of the model prediction output to achieve the effect of MSE regularization. The main loss is calculated by Equation (7), and the additional loss is calculated by Equation (8). The total loss of the model training is calculated by Equation (9):

L_{m a i n} = \frac{1}{N} \sum_{k = 1}^{K} {‖M_{f}^{(k)} - M_{g}^{(k)}‖}^{2},

(7)

L_{a d d} = \frac{1}{N} \sum_{k = 1}^{K} {‖M_{f}^{(k)} - M_{r}^{(k)}‖}^{2},

(8)

L_{f i n a l} = L_{m a i n} + {θ L}_{a d d} .

(9)

The main loss

L_{m a i n}

is the MSE loss between the true speech (ground truth, GT) Mel- spectrum

M_{g}^{(k)}

and the Mel-spectrum

M_{f}^{(k)}

predicted by the fine-tuning model.

N

indicates the number of samples. The additional loss

L_{a d d}

is the MSE loss between the Melspectrums

M_{r}^{(k)}

predicted by the initial reference model and

M_{f}^{(k)}

predicted by the fine-tuning model, respectively. The final total loss is obtained by configuring the trade-off parameter

θ

of these two losses. If the value of

θ

is too high, the fine-tuning model will focus too much on the output of the initial reference model. If it is too small, the performance learned from the source data will be lost. The purpose of the additional reference loss is to maintain a certain degree of generalization of the fine-tuning model, while the main loss is to maintain the emotional style of the target speech. For the value of the trade-off parameter, experiments were conducted to compare the final value of 0.1, and it was determined that the best results of speech synthesis were achieved. A comparison of the results of the experiments with different values is shown in Section 4.

In addition, in order to address the almost “catastrophic forgetting” situation that has difficulty pronouncing with out-of-domain text during the testing process, this paper tries to selectively slow down the updating of weights that are important relative to the part of learning features alignment to remember the knowledge learned from the old task. After analyzing the fine-tuning process, the role of the postprocessing network is only to improve the predicted spectral frame reconstruction results, so the layer and the vocoder are kept partially frozen, and the parameter weights are fixed, while the text embedding and the optimizer weights are ignored. Weights are also ignored because the textual information is independent of the speaker’s emotional style, so only the speech attributes of the pre-trained model are eventually fine-tuned. The model is first fine-tuned on the neutral data set to adapt the model to the speaker’s tone style of the new emotional dataset and then on each emotional data subset to learn to synthesize the corresponding emotional speech.

The overall training process of the low-resource emotional speech synthesis method based on the SMAL-ET2 model is shown in Figure 3.

4. Experiment and Results

In this paper, the effectiveness of synthesized emotional speech is evaluated and analyzed in two dimensions: subjective evaluation and objective evaluation. The baseline model methods compared are the method of fine-tuning the original Tacotron2 model and the GST [11] model based on emotional reference speech, both using the dataset ESD to synthesize emotional speech. These two baseline methods are hereafter referred to as ETaco2 and EGST, respectively.

4.1. Dataset and Preprocessing

The type of emotion, vocabulary, number of speakers, and recording quality contained in the emotion corpus are all significant for the study of emotion speech synthesis. The emotion corpus used in this work is the first parallel dataset ESD [40], which was recently open-sourced by the HLT (http://ece.nus.edu.sg/hlt/, accessed on 16 February 2022) laboratory at the National University of Singapore in 2021. The dataset consists of 350 parallel utterances with an average duration of 2.9 s, recorded by 10 native English speakers and 10 Mandarin speakers, 5 male and 5 female speakers for each language, with a speech sampling frequency of 16 kHz, containing five emotions: happy, sad, neutral, angry, and surprise. The linguistic team of Anton et al. [41] verified the English subset of the ESD dataset and concluded that its quality was at a fairly high level, with almost no mismatch between text and audio pairs found, which had no significant effect on the quality of the trained TTS system. Appropriate preprocessing of the data before training can speed up the convergence and lead to better synthesis results. In order to maintain consistency with the pre-trained speech data, the audio data taken during the experiments are sampled at a 22,050 kHz sampling rate and quantized at 16 bits. In order to ensure a strict linear correlation between the character order and the audio file in time sequence, the beginning and end of the speech and the silence of more than half a second in the sentence are checked and trimmed as necessary, the top_db parameter value is changed from the default 60 to 20, and the end-of-sentence tag EOS is added to speed up the convergence of the model.

All experiments in this paper were conducted on a high-performance GPU server containing two 12 GB GTX 2080 Ti and one RTX 3090 Ti graphics cards, and all code of implementation is written in Python 3.7 on Ubuntu 18.04.9, and the framework for deep learning is Pytorch 1.1.0 (https://pytorch.org, accessed on 9 March 2021). The experiments use the English dataset of LJ-Speech and the English part subset of ESD because LJ-Speech is a female speaker, so in order to make the learning effect better, the experiments choose the emotion dataset of the female speaker ‘0019′ with the best relative emotion expression effect, which contains 5 categories of common emotion data. Therefore, in this paper, five types of emotional data, namely neutral, happy, angry, surprised, and sad, were chosen for the experiment. The division of the dataset is shown in Table 1.

4.2. Experimental Setup

In this paper, we use the acoustic model part of the Tacotron2 model (https://github.com/NVIDIA/tacotron2, accessed on 24 February 2020) and the HiFi-GAN vocoder (https://github.com/jik876/hifi-gan, accessed on 3 Janurary 2023) as the basis for experimental development. The encoder part of this experimental model is set up as a 3-layer one-dimensional convolutional network and a single-layer bi-directional long short-term memory network (Bi-LSTM), each layer of the convolutional neural network (CNN) contains 512 convolutional kernels, the size of each convolutional kernel is 5 × 1, and the character embedding layer is 512 dimensions. In the decoder part, the spatial dimension of the attention module is set to 128, and the preprocessing network contains two fully connected layers, each of which has a size of 256 × 256. The postprocessing network is set to optimize the Mel-spectrum with 5 convolutional filters, and the size of the convolutional kernel is 31 × 1. The frame length is 50 ms, the frameshift is 12.5 ms, and the Hamming window is used as the window function.

Moreover, during the training process, the batch normalization technique is used to avoid the possible problem of gradient disappearance. The vocoder is also trained on the same large English dataset as Tacotron2, reducing the subsequent computational burden. The output predicted by the Tacotron2-based back-end acoustic models is fine-tuned, and the Mel-spectrum predicted by each back-end model is refined separately.

In the test phase, due to the limitation of the decoder stop label, the decoder will stop generating new frames once all the values in the generated frames are below a certain threshold, resulting in the generation of blank audio in the initial phase of the model. Therefore, this standard condition of stopping prediction is removed so that the decoder can synthesize the sound of the target emotion in a minimum number of iterations.

In the Tacotron2 model, the encoder and decoder parts are responsible for mapping the input text to the Mel-spectrum, and the larger the learning rate, the faster the weights are updated, so the weight decay is taken as 10⁻⁷. For the training process, in order to introduce the monotonic nature of SMA, the stepmono_attention parameter is further set to True, and the dropout of the attention mechanism is set to 0.4. The important parameters of each stage of the model are set as shown in Table 2. Dimension settings of each network layer are the same as in the Tacotron2 model.

4.3. Subjective Evaluation

In order to measure the naturalness and emotional expression of the synthesized speech, this paper gives 15 listeners 20 speech items of each emotion in random order and scores them using the mean opinion score MOS and the EMOS (emotion mean opinion score). The scores ranged from 1 to 5, representing the poor to good quality of the synthesized speech, with a span of 0.5 points. The specific details of the evaluation criteria are shown in Table 3.

The XAB preference test was also conducted. The scoring criterion was to compare which of the baseline method and the proposed method synthesized speech emotion closer to the emotion of the target audio. In all tests, 15 local listeners were asked to give their personal preferences for 20 synthesized emotional speeches randomly of each emotion selected from the test set, based on their subjective feelings.

For the trade-off parameter with the introduction of pseudo-label loss, a comparison of the values taken for different data sizes was made for the conditions of 30, 100, and 300 utterances, respectively, in which the performance of the majority of utterances did not exceed the baseline. Therefore, by introducing the reference loss and setting reasonable trade-off parameters, the generalization ability of the fine-tuning model can be achieved to some extent. When the value of the trade-off parameter is too large, it will result in the proposed fine-tuning method will be forced to output the same close Mel-spectrum as the initial reference model, and the excessive penalty will give it the problem of underfitting,

θ

taking 0.1, 0.3 and the zero-value as the baseline in which the pseudo-label loss is not involved in the effect of the results at all The MOS values for evaluating the naturalness of speech and the word error rate (WER) for responding to robustness are shown in Table 4. The final synthetic MOS values for each emotion are shown in Table 5, the EMOS scores are shown in Figure 4, and the results of the XAB preference test are shown in Figure 5. By analyzing the values of the subjective measures, it can be seen that the MOS and EMOS scores of SMAL-ET2 are higher when compared with the baseline method. The method yields better emotional expressiveness.

4.4. Objective Evaluation

The objective evaluation of this paper selects the Mel–Cepstral distortion (MCD) method. The calculation formula of MCD is shown in Equation (10), and the calculation result reflects the distance between the spectral features of the experimentally synthesized speech and the real speech of the target, and the MCD defined as an extension of the simple Euclidean parametrization. The smaller the MCD value, the higher the spectral similarity between the two, and the better the effect of the synthesized speech.

M C D [d b] = \frac{10 \sqrt{2}}{\ln 10} \frac{1}{M} \sqrt{\sum_{m = 1}^{M} {(y_{m} - y_{m}^{'})}^{2}} .

(10)

where

y_{m}

and

y_{m}^{'}

are the Mel–Cepstral coefficients (MCEPs) of real speech and synthesized speech, respectively.

M

indicating the dimension of MCEPs.

From Figure 6, we can see that the MCD value shows a decreasing trend, which indicates that the emotional speech synthesized by the method proposed in this paper is also closer to the real target GT.

4.5. Analyses

The semi-supervised learning approach based on generating pseudo-labels can free up exponentially more data and greatly reduce the training cost. This semi-supervised learning scheme, which uses the pseudo-labels provided by the reference model to further guide the acoustic model learning, significantly improves the speech quality of the test data. We choose the best open-source new emotional speech dataset-ESD, fully fine-tune the model for each type of emotion, let the model adaptively learn emotional information features, and splice the same pre-trained efficient and high-quality vocoder, to achieve a more natural synthesis and emotional expression closer to Ground Truth with the limited emotional speech training data.

After evaluating and analyzing the experimental results, the subjective evaluation MOS and EMOS scores are improved under the optimal number of iterations, the very low word error rate reflects good robustness, and the synthesized emotional speech is basically free from pronunciation errors. When the SMAL-ET2 was tested, texts that were not used in the model training process could also be pronounced correctly, and the generalization of the model was improved. Overall the performance of SMAL-ET2 is significantly better than the baseline methods.

We use a robust attention mechanism SMA and develop fine-tuning strategies based on the functions of different modules in the model, and additionally introduce a loss function of pseudo-labeling nature to guide the learning of emotional speech to achieve regularization effects and alleviate the exposure bias problem that occurs when fine-tuning the autoregressive model, thus improving the generalization ability of the model. The experimental results validate that our proposed method combines migration learning and semi-supervised training to mitigate the low resource problem by using information from both source and target domain data, resulting in better results in emotional speech synthesis tasks.

5. Conclusions

In this paper, we propose a low-resource emotional speech synthesis method based on knowledge migration and pseudo-label loss, where the semi-supervised training algorithm migrates acoustic knowledge from pre-trained models to low-resource speech synthesis tasks while providing reference loss to alleviate the overfitting problem of limited labeled datasets is effective, and enhanced robust attention mechanism when learning in the fine-tuning phase can better adapt to the unseen emotional speech. Experiments show that the naturalness, emotional similarity, and robustness of the final synthesized speech are improved compared to the baseline methods.

We also focus on the emerging research of combining lifelong learning, meta-learning, reinforcement learning, prompt learning, and diffusion models with speech synthesis techniques, which will be the focus of subsequent research directions. Through these hot research methods, we plan to continue to improve our method for the multi-speaker and the multi-lingual, especially for minority languages and Chinese regional dialects (even multi-model audio-visual speech synthesis with fine-grained and controllable emotions), in the future. Better data augmentation methods, better emotion representation learning methods, and real-time emotional speech synthesis with low-computing resources are also considered.

Author Contributions

Conceptualization, J.Z., M.W. and H.W.; Methodology, J.Z.; Software, J.Z.; Validation, J.Z.; Formal analysis, J.Z. and H.W.; Investigation, J.Z.; Resources, J.Z.; Data curation, J.Z.; Writing—original draft, J.Z.; Writing—review & editing, M.W., G.T. and H.W.; Supervision, M.W. and H.W.; Project administration, G.T.; Funding acquisition, M.W. and G.T. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by by the Natural Science Foundation of Autonomous Region under Grant 202104120016 and in part by the National Natural Science Foundation of China under Grant 2020680012. And this work was also supported in part by the Natural Science Foundation of Au-tonomous Region under Grant 2021D01C118, and in part by the Autonomous Region High-Level Innovative Talent Project under Grant 042419006.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Informed consent was obtained from all subjects involved in the study.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

Oord, A.V.D.; Dieleman, S.; Zen, H.; Simonyan, K.; Vinyals, O.; Graves, A.; Kalchbrenner, N.; Senior, A.; Kavukcuoglu, K. WaveNet: A Generative Model for Raw Audio. arXiv 2016, arXiv:1609.03499. [Google Scholar]
Wang, Y.; Skerry-Ryan, R.J.; Stanton, D.; Wu, Y.; Weiss, R.J.; Jaitly, N.; Yang, Z.; Xiao, Y.; Chen, Z.; Bengio, S.; et al. Tacotron: Towards End-to-End Speech Synthesis. arXiv 2017, arXiv:1703.10135. [Google Scholar]
Shen, J.; Pang, R.; Weiss, R.J.; Schuster, M.; Jaitly, N.; Yang, Z.; Chen, Z.; Zhang, Y.; Wang, Y.; Skerry-Ryan, R.J.; et al. Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions. In Proceedings of the Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions 2018, Calgary, AB, Canada, 15–20 April 2018. [Google Scholar]
Arık, S.Ö.; Chrzanowski, M.; Coates, A.; Diamos, G.; Gibiansky, A.; Kang, Y.; Li, X.; Miller, J.; Ng, A.; Raiman, J.; et al. Deep Voice: Real-Time Neural Text-to-Speech. In Proceedings of the 34th International Conference on Machine Learning, PMLR, Sydney, Australia, 6–11 August 2017; pp. 195–204. [Google Scholar]
Arik, S.; Diamos, G.; Gibiansky, A.; Miller, J.; Peng, K.; Ping, W.; Raiman, J.; Zhou, Y. Deep Voice 2: Multi-Speaker Neural Text-to-Speech. arXiv 2017, arXiv:1705.08947. [Google Scholar]
Ping, W.; Peng, K.; Gibiansky, A.; Arik, S.O.; Kannan, A.; Narang, S.; Raiman, J.; Miller, J. Deep Voice 3: Scaling Text-to-Speech with Convolutional Sequence Learning. arXiv 2017, arXiv:1710.07654. [Google Scholar]
Ren, Y.; Ruan, Y.; Tan, X.; Qin, T.; Zhao, S.; Zhao, Z.; Liu, T.-Y. FastSpeech: Fast, Robust and Controllable Text to Speech. In Proceedings of the Advances in Neural Information Processing Systems; Curran Associates, Inc.: Red Hook, NY, USA, 2019; Volume 32. [Google Scholar]
Ren, Y.; Hu, C.; Tan, X.; Qin, T.; Zhao, S.; Zhao, Z.; Liu, T.-Y. FastSpeech 2: Fast and High-Quality End-to-End Text to Speech. arXiv 2022, arXiv:2006.04558. [Google Scholar]
Tits, N.; Haddad, K.E.; Dutoit, T. Exploring Transfer Learning for Low Resource Emotional TTS. In Intelligent Systems and Applications. IntelliSys 2019. Advances in Intelligent Systems and Computing; Springer: Berlin/Heidelberg, Germany, 2019. [Google Scholar]
Zhou, K.; Sisman, B.; Li, H. Limited Data Emotional Voice Conversion Leveraging Text-to-Speech: Two-Stage Sequence-to-Sequence Training. arXiv 2021, arXiv:2103.16809. [Google Scholar]
Wang, Y.; Stanton, D.; Zhang, Y.; Skerry-Ryan, R.J.; Battenberg, E.; Shor, J.; Xiao, Y.; Ren, F.; Jia, Y.; Saurous, R.A. Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis. arXiv 2018, arXiv:1803.09017. [Google Scholar]
Kong, J.; Kim, J.; Bae, J. HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis. In Proceedings of the Advances in Neural Information Processing Systems; Curran Associates, Inc.: Red Hook, NY, USA, 2020; Volume 33, pp. 17022–17033. [Google Scholar]
Pan, S.J.; Yang, Q. A Survey on Transfer Learning. IEEE Trans. Knowl. Data Eng. 2010, 22, 1345–1359. [Google Scholar] [CrossRef]
Tits, N.; Haddad, K.E.; Dutoit, T. ASR-Based Features for Emotion Recognition: A Transfer Learning Approach. arXiv 2018, arXiv:1805.09197. [Google Scholar]
Jia, Y.; Zhang, Y.; Weiss, R.; Wang, Q.; Shen, J.; Ren, F.; Chen, Z.; Nguyen, P.; Pang, R.; Lopez Moreno, I.; et al. Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis. In Proceedings of the Advances in Neural Information Processing Systems; Curran Associates, Inc.: Red Hook, NY, USA, 2018; Volume 31. [Google Scholar]
Cai, Z.; Zhang, C.; Li, M. From Speaker Verification to Multispeaker Speech Synthesis, Deep Transfer with Feedback Constraint. Proc. Interspeech 2020, 2020, 3974–3978. [Google Scholar]
Chen, Y.; Assael, Y.; Shillingford, B.; Budden, D.; Reed, S.; Zen, H.; Wang, Q.; Cobo, L.C.; Trask, A.; Laurie, B.; et al. Sample Efficient Adaptive Text-to-Speech. arXiv 2019, arXiv:1809.10460. [Google Scholar]
Zhang, Z.; Tian, Q.; Lu, H.; Chen, L.-H.; Liu, S. AdaDurIAN: Few-Shot Adaptation for Neural Text-to-Speech with DurIAN. arXiv 2020, arXiv:2005.05642. [Google Scholar]
Sharma, M.; Kenter, T.; Clark, R. StrawNet: Self-Training WaveNet for TTS in Low-Data Regimes. In Proceedings of the Inter Speech 2020, ISCA, Shanghai, China, 25–29 October 2020; pp. 3550–3554. [Google Scholar]
Moss, H.B.; Aggarwal, V.; Prateek, N.; González, J.; Barra-Chicote, R. BOFFIN TTS: Few-Shot Speaker Adaptation by Bayesian Optimization. In Proceedings of the ICASSP 2020—2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020. [Google Scholar]
Shang, Z.; Huang, Z.; Zhang, H.; Zhang, P.; Yan, Y. Incorporating Cross-Speaker Style Transfer for Multi-Language Text-to-Speech. Proc. Interspeech 2021, 2021, 1619–1623. [Google Scholar]
Bollepalli, B.; Juvela, L.; Alku, P. Lombard Speech Synthesis Using Transfer Learning in a Tacotron Text-to-Speech System. In Proceedings of the Interspeech 2019, ISCA, Graz, Austria, 15–19 September 2019; pp. 2833–2837. [Google Scholar]
Debnath, A.; Patil, S.S.; Nadiger, G.; Ganesan, R.A. Low-Resource End-to-End Sanskrit TTS Using Tacotron2, WaveGlow and Transfer Learning. In Proceedings of the 2020 IEEE 17th India Council International Conference (INDICON), New Delhi, India, 10–13 December 2022. [Google Scholar]
Kuzmin, A.D.; Ivanov, S.A. Transfer Learning for the Russian Language Speech Synthesis. In Proceedings of the 2021 Interna tional Conference on Quality Management, Transport and Information Security, Information Technologies (IT&QM&IS), Yaroslavl, Russia, 6–10 September 2021; pp. 507–510. [Google Scholar]
Huang, A.; Bao, F.; Gao, G.; Shan, Y.; Liu, R. Mongolian Emotional Speech Synthesis Based on Transfer Learning and Emotional Embedding. In Proceedings of the International Conference on Asian Language Processing (IALP), Yantai, China, 23–25 December 2021; pp. 78–83. [Google Scholar]
Hinton, G.; Vinyals, O.; Dean, J. Distilling the Knowledge in a Neural Network. arXiv 2015, arXiv:1503.02531. [Google Scholar]
Taigman, Y.; Wolf, L.; Polyak, A.; Nachmani, E. VoiceLoop: Voice Fitting and Synthesis via a Phonological Loop. arXiv 2018, arXiv:1707.06588. [Google Scholar]
Lee, Y.; Rabiee, A.; Lee, S.-Y. Emotional End-to-End Neural Speech Synthesizer. arXiv 2017, arXiv:1711.05447. [Google Scholar]
Liu, R.; Sisman, B.; Li, J.; Bao, F.; Gao, G.; Li, H. Teacher-Student Training for Robust Tacotron-Based TTS. In Proceedings of the ICASSP 2020—2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020. [Google Scholar]
Lee, D.-H. Pseudo-Label: The Simple and Efficient Semi-Supervised Learning Method for Deep Neural Networks. In Proceedings of the ICML 2013 Workshop on Challenges in Representation Learning, Atlanta, GA, USA, 16–21 June 2013. [Google Scholar]
Higuchi, Y.; Moritz, N.; Roux, J.L.; Hori, T. Advancing Momentum Pseudo-Labeling with Conformer and Initialization Strategy. 2021. In Proceedings of the ICASSP 2022—2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 23–27 May 2022. [Google Scholar]
Chung, Y.-A.; Wang, Y.; Hsu, W.-N.; Zhang, Y.; Skerry-Ryan, R.J. Semi-Supervised Training for Improving Data Efficiency in End-to-End Speech Synthesis. In Proceedings of the ICASSP 2019—2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK, 12–17 May 2019. [Google Scholar]
Zhang, X.; Wang, J.; Cheng, N.; Xiao, J. Semi-Supervised Learning Based on Reference Model for Low-Resource TTS. In Proceedings of the 2022 18th International Conference on Mobility, Sensing and Networking (MSN), Guangzhou, China, 14–16 December 2022. [Google Scholar]
Xie, Q.; Luong, M.T.; Hovy, E.; Le, Q.V. Self-training with noisy student improves imagenet classification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 10687–10698. [Google Scholar]
Schuller, D.M.; Schuller, B.W. A Review on Five Recent and Near-Future Developments in Computational Processing of Emo tion in the Human Voice. Emot. Rev. 2021, 13, 44–50. [Google Scholar] [CrossRef]
Tachibana, H.; Uenoyama, K.; Aihara, S. Efficiently Trainable Text-to-Speech System Based on Deep Convolutional Networks with Guided Attention. In Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, 15–20 April 2018; pp. 4784–4788. [Google Scholar]
Zhang, J.-X.; Ling, Z.-H.; Dai, L.-R. Forward Attention in Sequence-to-Sequence Acoustic Modelling for Speech Synthesis. In Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, 15–20 April 2018; pp. 4789–4793. [Google Scholar]
Battenberg, E.; Skerry-Ryan, R.J.; Mariooryad, S.; Stanton, D.; Kao, D.; Shannon, M.; Bagby, T. Location-Relative Attention Mechanisms for Robust Long-Form Speech Synthesis. In Proceedings of the ICASSP 2020—2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020. [Google Scholar]
He, M.; Deng, Y.; He, L. Robust Sequence-to-Sequence Acoustic Modeling with Stepwise Monotonic Attention for Neural TTS. In Proceedings of the Interspeech 2019, ISCA, Graz, Austria, 15–19 September 2019; pp. 1293–1297. [Google Scholar]
Zhou, K.; Sisman, B.; Liu, R.; Li, H. Emotional Voice Conversion: Theory, Databases and ESD. Speech Commun. 2022, 137, 1–18. [Google Scholar] [CrossRef]
Low-Resource Emotional Speech Synthesis: Transfer Learning and Data Requirements. SpringerLink. Available online: https://linkspringer.53yu.com/chapter/10.1007/978-3-031-20980-2_43 (accessed on 6 April 2023).

Figure 1. Illustration of low-resource transfer learning.

Figure 2. Network structure of SMAL-ET2 model.

Figure 3. Flow chart of the proposed low-resource emotional TTS method.

Figure 4. EMOS of EGST, ETaco2, and SMAL-ET2.

Figure 5. XAB preference test of ETaco2 and SMAL-ET2.

Figure 6. Averaged MCD value of EGST, ETaco2, and SMAL-ET2.

Table 1. Partition of ESD dataset.

ESD (English Subset)	Number of Utterances
Evaluation set	20
Test set	30
Training set	300

Table 2. Settings of the Hyperparameters.

Hyperparameters	Settings
Adam optimizer β₁	0.9
Adam optimizer β₂	0.999
Batch_size	32
Initial learning rate	0.002
Finetune learning rate	0.00002
Decay learning rate	True
Sample rate	22,050
hidden vectors of SMA	128
Attention dropout	0.4
Decoder dropout	0.5
zoneout	0.1
epochs	1000

Table 3. Evaluation score standards of MOS and EMOS.

Score	MOS Standards	EMOS Standards
0–1.0	Very poor sound quality, difficult to understand; large latency, poor communication	Emotional similarity unknown
1.0–2.0	General sound quality, not very clear to hear; large delay, not clear, not smooth; communication needs to be repeated several times	Blurred emotional similarity
2.0–3.0	Sound quality is not bad, more audible; there is a certain delay, there is noise, but acceptable	Emotional similarity is acceptable
3.0–4.0	Sound quality is very good, you can hear clearly; only a little delay; willing to accept	Emotional similarity willingness to accept
4.0–5.0	Sound quality is particularly good, very clear and natural; almost no delay, smooth communication	Ideal emotional similarity

Table 4. Numerical comparison of models under different dataset sizes.

Data Size	θ	MOS (Neutral)	WER(%)
30	0	2.65 ± 0.20	2.5
100	0	3.23 ± 0.18	2.0
300	0	3.85 ± 0.07	1.5
30	0.1 (BL)	3.20 ± 0.28	1.5
100	0.1 (BL)	3.49 ± 0.22	0.5
300	0.1 (BL)	4.14 ± 0.10	0.2
30	0.3	3.14 ± 0.16	2.0
100	0.3	3.93 ± 0.35	1.0
300	0.3	3.96 ± 0.09	0.3

Note: ± is the standard deviation symbol.

Table 5. MOS Results with 95% Confidence Intervals.

Emotion	EGST	ETaco2	SMAL-ET2	Ground Truth
Neutral	3.15 ± 0.31	3.52 ± 0.23	3.97 ± 0.28	4.81 ± 0.16
Sad	3.32 ± 0.29	3.55 ± 0.21	3.77 ± 0.15	4.83 ± 0.19
Happy	3.47 ± 0.16	3.61 ± 0.09	4.15 ± 0.19	4.94 ± 0.09
Surprise	3.19 ± 0.27	3.43 ± 0.30	3.96 ± 0.24	4.86 ± 0.07
Angry	2.92 ± 0.20	3.34 ± 0.17	3.88 ± 0.25	4.81 ± 0.13

Note: ± is the standard deviation symbol.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhang, J.; Wushouer, M.; Tuerhong, G.; Wang, H. Semi-Supervised Learning for Robust Emotional Speech Synthesis with Limited Data. Appl. Sci. 2023, 13, 5724. https://doi.org/10.3390/app13095724

AMA Style

Zhang J, Wushouer M, Tuerhong G, Wang H. Semi-Supervised Learning for Robust Emotional Speech Synthesis with Limited Data. Applied Sciences. 2023; 13(9):5724. https://doi.org/10.3390/app13095724

Chicago/Turabian Style

Zhang, Jialin, Mairidan Wushouer, Gulanbaier Tuerhong, and Hanfang Wang. 2023. "Semi-Supervised Learning for Robust Emotional Speech Synthesis with Limited Data" Applied Sciences 13, no. 9: 5724. https://doi.org/10.3390/app13095724

APA Style

Zhang, J., Wushouer, M., Tuerhong, G., & Wang, H. (2023). Semi-Supervised Learning for Robust Emotional Speech Synthesis with Limited Data. Applied Sciences, 13(9), 5724. https://doi.org/10.3390/app13095724

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Semi-Supervised Learning for Robust Emotional Speech Synthesis with Limited Data

Abstract

1. Introduction

2. Related Work

3. Proposed Methods

3.1. Emotional Speech Synthesis Method

3.2. Training Process

4. Experiment and Results

4.1. Dataset and Preprocessing

4.2. Experimental Setup

4.3. Subjective Evaluation

4.4. Objective Evaluation

4.5. Analyses

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI