A Review of Deep Learning Based Speech Synthesis
Abstract
:1. Introduction
2. An Overview of Speech Synthesis
2.1. Basic Concept of Speech Synthesis
2.2. The History of Speech Synthesis
2.3. Traditional Speech Synthesis Technology
2.3.1. Concatenative Speech Synthesis
2.3.2. Parametric Speech Synthesis
3. Statistical Parametric Speech Synthesis
3.1. Text Analysis
3.2. Parameter Prediction
3.2.1. HMM-Based Parameter Prediction
3.2.2. DNN-Based Parameter Prediction
3.3. Vocoder-Based Speech Synthesis
4. Deep Learning Based Speech Synthesis
4.1. Restrictive Boltzmann Machines for Speech Synthesis
4.2. Multi-Distribution Deep Belief Networks for Speech Synthesis
4.3. Speech Synthesis Using Deep Mixture Density Networks
4.3.1. Mixture Density Networks
4.3.2. Deep MDN-Based Speech Synthesis
4.4. Deep Bidirectional LSTM-Based Speech Synthesis
4.4.1. BLSTM
4.4.2. Deep BLSTM-Based Speech Synthesis
4.5. Sequence-to-Sequence Speech Synthesis
4.6. End-to-End Speech Synthesis
4.6.1. Speech Synthesis Based on WaveNet
4.6.2. Speech Synthesis Based on Tacotron
4.6.3. Speech Synthesis Based on Convolutional Neural Networks (CNNs)
5. Discussion
- Investigating context features hidden in end-to-end speech synthesis. The end-to-end TTS system, mostly back-end, has achieved state-of-the-art performance since it was proposed. However, there is little progress in front-end text analysis, which extracts context features or linguistic features that are very useful to bridge the gap between text and speech [78]. Therefore, demonstrating what types of context information are utilized in end-to-end speech synthesis system is a good direction in future.
- Semi-supervised or unsupervised training in end-to-end speech synthesis. Although end-to-end TTS models have shown excellent results, they typically require large amounts of high-quality <text, speech> pairs for training, which are expensive and time-consuming to collect. It is important and of great significance to improve the data efficiency for end-to-end TTS training by leveraging a large scale of publicly available unpaired text and speech recordings [79].
- The application of other speech related scenarios. In addition to the application of text-to-speech in this paper, the application to other scenarios such as voice conversion, audio-visual speech synthesis, speech translation and cross-lingual speech synthesis is also a good direction.
- The combination of software and hardware. At present, most deep neural networks require a lot of calculations. Therefore, parallelization will be an indispensable part of improving network efficiency. In general, there are two ways to implement parallelization: one is the parallelization of the machines; the other is to use GPU parallelization. However, since writing GPU code is still time-consuming and laborious for most researchers, it depends on the cooperation of hardware vendors and software vendors, to provide the industry with more and more intelligent programming tools.
6. Conclusions
Author Contributions
Funding
Acknowledgments
Conflicts of Interest
Abbreviations
TTS | Text-to-Speech |
SPSS | Statistical Parametric Speech Synthesis |
HMM | Hidden Markov Model |
DL | Deep Learning |
DEC | Digital Equipment Corporation |
POS | Part-of-Speech |
DNN | Deep Neural Network |
LPC | Linear Prediction Coefficient |
PSOLA | Pitch Synchronous OverLap Add |
SPSS | Statistical Parametric Speech Synthesis |
CD-HMM | Continuous Density Hidden Markov Model |
GMM | Gaussian Mixture Model |
RBM | Restricted Boltzmann Machines |
DBN | Deep Belief Networks |
MLE | Maximum Likelihood Estimation |
MDN | Mixed Density Network |
RNN | Recurrent Neural Network |
LSTM | Long Short-Term Memory |
BLSTM | Bidirectional Long Short-term Memory |
CNN | Convolutional Neural Network |
DAE | Deep Auto-Encoder |
VAE | Variational Auto-Encoder |
MOS | Mean Opinion Score |
References
- Klatt, D.H. Review of text-to-speech conversion for English. J. Acoust. Soc. Am. 1987, 82, 737–793. [Google Scholar] [CrossRef] [PubMed]
- Allen, J.; Hunnicutt, M.S.; Klatt, D.H.; Armstrong, R.C.; Pisoni, D.B. From Text to Speech: The MITalk System; Cambridge University Press: New York, NY, USA, 1987. [Google Scholar]
- Murray, I.R.; Arnott, J.L.; Rohwer, E.A. Emotional stress in synthetic speech: Progress and future directions. Speech Commun. 1996, 20, 85–91. [Google Scholar] [CrossRef]
- Festival. Available online: http://www.cstr.ed.ac.uk/projects/festival/ (accessed on 3 July 2019).
- Chu, M.; Peng, H.; Zhao, Y. Microsoft Mulan. A bilingual TTS system. In Proceedings of the 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP’03), Hong Kong, China, 6–10 April 2003; pp. 264–267. [Google Scholar]
- Tokuda, K.; Nankaku, Y.; Toda, T. Speech synthesis based on hidden Markov models. Proc. IEEE 2013, 101, 1234–1252. [Google Scholar] [CrossRef]
- Murray, I.R. Simulating Emotion in Synthetic Speech; University of Dundee: Dundee, UK, 1989. [Google Scholar]
- Tokuda, K.; Yoshimura, T.; Masuko, T.; Kobayashi, T.; Kitamura, T. Speech parameter generation algorithms for HMM-based speech synthesis. In Proceedings of the 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing, Istanbul, Turkey, 5–9 June 2000; Volume 3, pp. 1315–1318. [Google Scholar]
- Ratnaparkhi, A. A Simple Introduction to Maximum Entropy Models for Natural Language Processing; University of Pennsylvania: Philadelphia, PA, USA, 1997. [Google Scholar]
- Yang, J.A.; Wang, Y.; Liu, H.; Li, J.H.; Lu, J. Deep learning theory and its application in speech recognition. Commun. Countermeas. 2014, 33, 1–5. [Google Scholar]
- Graves, A. Supervised Sequence Labelling with Recurrent Neural Networks; Springer: Berlin, Germany, 2012. [Google Scholar]
- Sutskever, I.; Vinyals, O.; Le, Q.V. Sequence to sequence learning with neural networks. In Proceedings of the Annual Conference on Neural Information Processing Systems, Montreal, QB, Canada, 8–13 December 2014; pp. 3104–3112. [Google Scholar]
- Graves, A.; Mohamed, A.R.; Hinton, G. Speech recognition with deep recurrent neural networks. In Proceedings of the 38th IEEE International Conference on Acoustics, Speech and Signal Processing, Vancouver, BC, Canada, 26–31 May 2013; pp. 6645–6649. [Google Scholar]
- Zen, H.; Tokuda, K.; Alan, W.B. Statistical parametric speech synthesis. Speech Commun. 2009, 51, 1039–1064. [Google Scholar] [CrossRef]
- Xu, S.H. Study on HMM-Based Chinese Speech Synthesis; Beijing University of Posts and Telecommunications: Beijing, China, 2007. [Google Scholar]
- Sotelo, J.; Mehri, S.; Kumar, K.; Santos, J.F.; Kastner, K.; Courville, A.; Bengio, Y. Char2wav: End-to-end Speech Synthesis. In Proceedings of the International Conference on Learning Representations Workshop, Toulon, France, 24–26 April 2017. [Google Scholar]
- Klatt, D.H. Software for a cascade/parallel formant synthesizer. J. Acoust. Soc. Am. 1980, 67, 971–995. [Google Scholar] [CrossRef] [Green Version]
- Moulines, E.; Charpentier, F. Pitch-synchronous waveform processing techniques for text-to-speech synthesis using diphone. Speech Commun. 1990, 9, 453–456. [Google Scholar] [CrossRef]
- Yoshimura, T.; Tokuda, K.; Masuko, T.; Kobayashi, T.; Kitamura, T. Simultaneous modeling of spectrum, pitch and duration in HMM-based speech synthesis. In Proceedings of the Sixth European Conference on Speech Communication and Technology (EUROSPEECH’99), Budapest, Hungary, 5–9 September 1999; pp. 2347–2350. [Google Scholar]
- Zen, H.; Senior, A.; Schuster, M. Statistical parametric speech synthesis using deep neural networks. In Proceedings of the 38th IEEE International Conference on Acoustics, Speech, and Signal Processing, Vancouver, BC, Canada, 26–31 May 2013; pp. 7962–7966. [Google Scholar]
- Atal, B.S.; Hanauer, S.L. Speech analysis and synthesis by linear prediction of the speech wave. J. Acoust. Soc. Am. 1971, 50, 637–655. [Google Scholar] [CrossRef]
- Wu, Y.J. Study on the HMM-Based Speech Synthesis Techniques; University of Science and Technology of China: Hefei, China, 2006. [Google Scholar]
- Cataldo, E.; Leta, F.R.; Lucero, J.; Nicolato, L. Synthesis of voiced sounds using low-dimensional models of the vocal cords and time-varying subglottal pressure. Mech. Res. Commun. 2016, 33, 250–260. [Google Scholar] [CrossRef]
- Schröder, M. Emotional speech synthesis: A review. In Proceedings of the Seventh European Conference on Speech Communication and Technology, Aalborg, Denmark, 2–7 September 2001. [Google Scholar]
- Zen, H.; Nose, T.; Yamagishi, J.; Sako, S.; Masuko, T.; Black, A.W.; Tokuda, K. The HMM-based speech synthesis system (HTS) version 2.0. In Proceedings of the ISCA Workshop on Speech Synthesis, Bonn, Germany, 22–24 August 2007; pp. 294–299. [Google Scholar]
- Meng, F.B. Analysis and Generation of Focus in Continuous Speech; Tsinghua University: Beijing, China, 2013. [Google Scholar]
- Zhuang, X.; Huang, J.; Potamianos, G.; Hasegawa-Johnson, M. Acoustic fall detection using Gaussian mixture models and GMM supervectors. In Proceedings of the 34th IEEE International Conference on Acoustics, Speech and Signal Processing, Taipei, Taiwan, 19–24 April 2009; pp. 69–72. [Google Scholar]
- Lin, Z.H. Research on Speech Synthesis Technology Based on Statistical Acoustic Modeling; University of Science and Technology of China: Hefei, China, 2008. [Google Scholar]
- Fan, Y.; Qian, Y.; Xie, F.L.; Soong, F.K. TTS synthesis with bidirectional LSTM based recurrent neural networks. In Proceedings of the Fifteenth Annual Conference of the International Speech Communication Association, Singapore, 14–18 September 2014. [Google Scholar]
- Zen, H. Acoustic modeling in statistical parametric speech synthesis-from HMM to LSTM-RNN. In Proceedings of the The First, International Workshop on Machine Learning in Spoken Language Processing (MLSLP2015), Aizu, Japan, 19–20 September 2015. [Google Scholar]
- Dudley, H. Remaking speech. J. Acoust. Soc. Am. 1939, 11, 169–177. [Google Scholar] [CrossRef]
- Kawahara, H.; Masuda-Katsuse, I.; Cheveigne, A. Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based f0 extraction. Speech Commun. 1999, 27, 187–207. [Google Scholar] [CrossRef]
- HMM/DNN-Based Speech Synthesis System (HTS). Available online: http://hts.sp.nitech.ac.jp/ (accessed on 15 March 2015).
- Banno, H.; Hata, H.; Morise, M.; Takahashi, T.; Irino, T.; Kawahara, H. Implementation of realtime STRAIGHT speech manipulation system: Report on its first implementation. Acoust. Sci. Technol. 2007, 28, 140–146. [Google Scholar] [CrossRef] [Green Version]
- STRAIGHT. Available online: https://github.com/shuaijiang/STRAIGHT (accessed on 25 July 2018).
- Flanagan, J.L.; Golden, R.M. Phase vocoder. Bell Syst. Tech. J. 1966, 45, 1493–1509. [Google Scholar] [CrossRef]
- McAulay, R.; Quatieri, T.F. Speech analysis/synthesis based on a sinusoidal representation. IEEE Trans. Acoust. Speech Signal Process. 1986, 34, 744–754. [Google Scholar] [CrossRef]
- Morise, M.; Yokomori, F.; Ozawa, K. WORLD: A vocoder-based high-quality speech synthesis system for real-time applications. IEICE Trans. Inf. Syst. 2016, 99, 1877–1884. [Google Scholar] [CrossRef]
- World. Available online: https://github.com/mmorise/World (accessed on 18 May 2019).
- Ling, Z.H.; Deng, L.; Yu, D. Modeling spectral envelopes using restricted Boltzmann machines for statistical parametric speech synthesis. In Proceedings of the 38th IEEE International Conference on Acoustics, Speech and Signal Processing, Vancouver, BC, Canada, 26–31 May 2013; pp. 7825–7829. [Google Scholar]
- Deng, L.; Seltzer, M.L.; Yu, D.; Acero, A.; Mohamed, A.R.; Hinton, G. Binary coding of speech spectrograms using a deep auto-encoder. In Proceedings of the Eleventh Annual Conference of the International Speech Communication Association, Makuhari, Japan, 26–30 September 2010; pp. 1692–1695. [Google Scholar]
- Gehring, J.; Miao, Y.; Metze, F.; Waibel, A. Extracting deep bottleneck features using stacked auto-encoders. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, Vancouver, BC, Canada, 26–31 May 2013; pp. 3377–3381. [Google Scholar]
- Kang, S.Y.; Qian, X.J.; Meng, H. Multi-distribution deep belief network for speech synthesis. In Proceedings of the 38th IEEE International Conference on Acoustics, Speech and Signal Processing, Vancouver, BC, Canada, 26–31 May 2013; pp. 8012–8016. [Google Scholar]
- Zen, H.; Senior, A. Deep mixture density networks for acoustic modeling in statistical parametric speech synthesis. In Proceedings of the 39th IEEE International Conference on Acoustics, Speech and Signal Processing, Florence, Italy, 4–9 May 2014; pp. 3844–4848. [Google Scholar]
- Bishop, C. Mixture Density Networks; Tech. Rep. NCRG/94/004; Neural Computing Research Group, Aston University: Birmingham, UK, 1994. [Google Scholar]
- Graves, A.; Schmidhuber, J. Framewise phoneme classification with bidirectional LSTM networks. In Proceedings of the IEEE International Joint Conference on Neural Networks, Montreal, QC, Canada, 31 July–4 August 2005; pp. 2047–2052. [Google Scholar]
- Graves, A.; Fernandez, S.; Gomez, F.; Schmidhuber, J. Connectionist temporal classification: Labeling unsegmented sequence data with recurrent neural networks. In Proceedings of the 23rd International Conference on Machine Learning, Pittsburgh, PA, USA, 25–29 June 2006; pp. 369–376. [Google Scholar]
- Li, R.N.; Wu, Z.Y.; Liu, X.Y.; Meng, H.; Cai, L.H. Multi-task learning of structured output layer bidirectional LSTMs for speech synthesis. In Proceedings of the 42nd IEEE International Conference on Acoustics, Speech and Signal Processing, New Orleans, LA, USA, 5–9 March 2017; pp. 5510–5514. [Google Scholar]
- Yu, Q.J.; Liu, P.; Wu, Z.Y.; Kang, S.Y.; Meng, H.; Cai, L.H. Learning cross-lingual information with multilingual BLSTM for speech synthesis of low-resource languages. In Proceedings of the 41st IEEE International Conference on Acoustics, Speech and Signal Processing, Shanghai, China, 20–25 March 2016; pp. 5545–5549. [Google Scholar]
- Caruana, R. Multitask Learning; Springer: Berlin, Germany, 1998. [Google Scholar]
- Seltzer, M.L.; Droppo, J. Multi-task learning in deep neural networks for improved phoneme recognition. In Proceedings of the 38th IEEE International Conference on Acoustics, Speech and Signal Processing, Vancouver, BC, Canada, 26–31 May 2013; pp. 6965–6969. [Google Scholar]
- Luong, M.T.; Pham, H.; Manning, C.D. Effective approaches to attention-based neural machine translation. arXiv 2015, arXiv:1508.04025. [Google Scholar]
- Jaitly, N.; Le, Q.V.; Vinyals, O.; Sutskever, I.; Sussillo, D.; Bengio, S. An online sequence-to-sequence model using partial conditioning. In Proceedings of the Annual Conference on Neural Information Processing Systems, Barcelona, Spain, 5–10 December 2016; pp. 5067–5075. [Google Scholar]
- Xu, K.; Ba, J.; Kiros, R.; Cho, K.; Courville, A.; Salakhudinov, R.; Bengio, Y. Show, attend and tell: Neural image caption generation with visual attention. In Proceedings of the 32nd International Conference on Machine Learning, Lille, France, 6–11 July 2015; pp. 2048–2057. [Google Scholar]
- Wang, W.; Xu, S.; Xu, B. First step towards end-to-end parametric TTS synthesis: Generating spectral parameters with neural attention. In Proceedings of the Seventeenth Annual Conference of the International Speech Communication Association, San Francisco, CA, USA, 8–12 September 2016; pp. 2243–2247. [Google Scholar]
- Bahdanau, D.; Cho, K. ; Bengio. Neural machine translation by jointly learning to align and translate. arXiv 2014, arXiv:1409.0473. [Google Scholar]
- Zhang, J.X.; Ling, Z.H.; Dai, L.R. Forward attention in sequence-to-sequence acoustic modeling for speech synthesis. In Proceedings of the 43rd IEEE International Conference on Acoustics, Speech and Signal Processing, Calgary, AB, Canada, 15–20 April 2018; pp. 4789–4793. [Google Scholar]
- Oord, A.V.D.; Dieleman, S.; Zen, H.; Simonyan, K.; Vinyals, O.; Graves, A.; Kavukcuoglu, K. WaveNet: A generative model for raw audio. arXiv Preprint 2017, arXiv:1609.03499. [Google Scholar]
- Oord, A.V.D.; Kalchbrenner, N.; Espeholt, L.; Vinyals, O.; and Graves, A. Conditional image generation with pixelcnn decoders. In Proceedings of the Annual Conference on Neural Information Processing Systems, Barcelona, Spain, 5–10 December 2016; pp. 4790–4798. [Google Scholar]
- Oord, A.V.D.; Kalchbrenner, N.; Kavukcuoglu, K. Pixel recurrent neural networks. arXiv 2016, arXiv:1601.06759. [Google Scholar]
- Oord, A.V.D.; Li, Y.; Babuschkin, I.; Simonyan, K.; Vinyals, O.; Kavukcuoglu, K.; Casagrande, N. Parallel wavenet: Fast high-fidelity speech synthesis. arXiv 2017, arXiv:1711.10433. [Google Scholar]
- Arik, S.Ö.; Chrzanowski, M.; Coates, A.; Diamos, G.; Gibiansky, A.; Kang, Y.; Sengupta, S. Deep voice: Real-time neural text-to-speech. In Proceedings of the 34th International Conference on Machine Learning, Sydney, NSW, Australia, 6–11 August 2017; pp. 195–204. [Google Scholar]
- Wang, Y.; Skerry-Ryan, R.J.; Stanton, D.; Wu, Y.; Weiss, R.J.; Jaitly, N.; Le, Q. Tacotron: Towards end-to-end speech synthesis. arXiv 2017, arXiv:1703.10135. [Google Scholar]
- Wang, Y.; Skerry-Ryan, R.J.; Stanton, D.; Wu, Y.; Weiss, R.J.; Jaitly, N.; Le, Q.V. Tacotron: A fully end-to-end text-to-speech synthesis model. arXiv 2017, arXiv:1703.10135. [Google Scholar]
- Griffin, D.; Lim, J. Signal estimation from modified short-time Fourier transform. IEEE Trans. Acoust. Speech Signal Process. 1984, 32, 236–243. [Google Scholar] [CrossRef]
- Barron, A. Implementation of Google’s Tacotron in TensorFlow. Available online: https://github.com/Kyubyong/tacotron (accessed on 20 October 2018).
- Ito, K. Tacotron Speech Synthesis Implemented in TensorFlow, with Samples and a Pre-Trained Model. Available online: https://github.com/keithito/tacotron (accessed on 20 October 2018).
- Yamamoto, R. PyTorch Implementation of Tacotron Speech Synthesis Model. Available online: https://github.com/r9y9/tacotron_pytorch (accessed on 20 October 2018).
- Tachibana, H.; Uenoyama, K.; Aihara, S. Efficiently trainable text-to-speech system based on deep convolutional networks with guided attention. In Proceedings of the 43rd IEEE International Conference on Acoustics, Speech and Signal Processing, Calgary, AB, Canada, 15–20 April 2018; pp. 4784–4788. [Google Scholar]
- Zhang, Y.J.; Pan, S.; He, L.; Ling, Z.H. Learning latent representations for style control and transfer in end-to-end speech synthesis. In Proceedings of the 44th IEEE International Conference on Acoustics, Speech and Signal Processing, Brighton, UK, 12–17 May 2019; pp. 6945–6949. [Google Scholar]
- Kingma, D.P.; Welling, M. Auto-encoding variational bayes. arXiv 2014, arXiv:1312.6114. [Google Scholar]
- Gibiansky, A.; Arik, S.; Diamos, G.; Miller, J.; Peng, K.; Ping, W.; Zhou, Y. Deep voice 2: Multi-speaker neural text-to-speech. In Proceedings of the Annual Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; pp. 2962–2970. [Google Scholar]
- Shen, J.; Pang, R.; Weiss, R.J.; Schuster, M.; Jaitly, N.; Yang, Z.; Saurous, R.A. Natural TTS synthesis by conditioning WaveNet on mel spectrogram predictions. In Proceedings of the 43rd IEEE International Conference on Acoustics, Speech and Signal Processing, Calgary, AB, Canada, 15–20 April 2018; pp. 4779–4783. [Google Scholar]
- Yasuda, Y.; Wang, X.; Takaki, S.; Yamagishi, J. Investigation of enhanced Tacotron text-to-speech synthesis systems with self-attention for pitch accent language. In Proceedings of the 44th IEEE International Conference on Acoustics, Speech and Signal Processing, Brighton, UK, 12–17 May 2019; pp. 6905–6909. [Google Scholar]
- Nakamura, K.; Hashimoto, K.; Oura, K.; Nankaku, Y.; Tokuda, K. Singing voice synthesis based on convolutional neural networks. arXiv 2019, arXiv:1904.06868. [Google Scholar]
- Ping, W.; Peng, K.; Gibiansky, A.; Arik, S.O.; Kannan, A.; Narang, S. Deep voice 3: Scaling text-to-speech with convolutional sequence learning. In Proceedings of the Sixth International Conference on Learning Representations, Vancouver, BC, Canada, April 30–3 May 2018; pp. 1–16. [Google Scholar]
- Chen, H.; Liu, X.; Yin, D.; Tang, J. A survey on dialogue systems: Recent advances and new frontiers. ACM SIGKDD Explor. Newsl. 2017, 19, 25–35. [Google Scholar] [CrossRef]
- Mametani, K.; Kato, T.; Yamamoto, S. Investigating context features hidden in End-to-End TTS. In Proceedings of the IEEE the 44th International Conference on Acoustics, Speech and Signal Processing, Brighton, UK, 12–17 May 2019; pp. 6920–6924. [Google Scholar]
- Chung, Y.A.; Wang, Y.; Hsu, W.N.; Zhang, Y.; Skerry-Ryan, R.J. Semi-supervised training for improving data efficiency in end-to-end speech synthesis. In Proceedings of the 44th IEEE International Conference on Acoustics, Speech and Signal Processing, Brighton, UK, 12–17 May 2019; pp. 6940–6944. [Google Scholar]
Methods | Advantages | Disadvantages |
---|---|---|
HMM | Flexible with changing voice characteristics and the system is robust | The acoustic features are oversmoothed, making the generated speech sounds muffled |
RBM | Can better describe the distribution of high-dimensional spectral envelopes to alleviate the over-smooth problem | Suffer from the fragementation problem of training data |
DBN | Cannot suffer from the training data fragementation problem and reduce the over-smoothing problem | The quality of generated speech will be degraded |
DMDN | Can solve the single modality problem | Can only leverage limited contexts and each frame is mapped independently |
DBLSTM | Can fully leverage contextual information | Still needs a vocoder to synthesize waveform |
WaveNet | Can produce high-quality speech waveforms | Too slow and the errors from the front-end will affect the synthesis effect |
Tacotron | Fully end-to-end speech synthesis model and can produce high-quality speech waveforms | Quite costly to train the model |
CNN | Fast to train the model | The speech quality might be degraded |
© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
Share and Cite
Ning, Y.; He, S.; Wu, Z.; Xing, C.; Zhang, L.-J. A Review of Deep Learning Based Speech Synthesis. Appl. Sci. 2019, 9, 4050. https://doi.org/10.3390/app9194050
Ning Y, He S, Wu Z, Xing C, Zhang L-J. A Review of Deep Learning Based Speech Synthesis. Applied Sciences. 2019; 9(19):4050. https://doi.org/10.3390/app9194050
Chicago/Turabian StyleNing, Yishuang, Sheng He, Zhiyong Wu, Chunxiao Xing, and Liang-Jie Zhang. 2019. "A Review of Deep Learning Based Speech Synthesis" Applied Sciences 9, no. 19: 4050. https://doi.org/10.3390/app9194050