Next Article in Journal
Game Analysis of Access Control Based on User Behavior Trust
Previous Article in Journal
Combined Recommendation Algorithm Based on Improved Similarity and Forgetting Curve
Open AccessArticle

Es-Tacotron2: Multi-Task Tacotron 2 with Pre-Trained Estimated Network for Reducing the Over-Smoothness Problem

School of Computer Science and Engineering, Central South University, Changsha 410083, China
*
Author to whom correspondence should be addressed.
Information 2019, 10(4), 131; https://doi.org/10.3390/info10040131
Received: 27 January 2019 / Revised: 6 April 2019 / Accepted: 8 April 2019 / Published: 9 April 2019
(This article belongs to the Section Artificial Intelligence)
Text-to-speech synthesis is a computational technique for producing synthetic, human-like speech by a computer. In recent years, speech synthesis techniques have developed, and have been employed in many applications, such as automatic translation applications and car navigation systems. End-to-end text-to-speech synthesis has gained considerable research interest, because compared to traditional models the end-to-end model is easier to design and more robust. Tacotron 2 is an integrated state-of-the-art end-to-end speech synthesis system that can directly predict closed-to-natural human speech from raw text. However, there remains a gap between synthesized speech and natural speech. Suffering from an over-smoothness problem, Tacotron 2 produced ’averaged’ speech, making the synthesized speech sounds unnatural and inflexible. In this work, we first propose an estimated network (Es-Network), which captures general features from a raw mel spectrogram in an unsupervised manner. Then, we design Es-Tacotron2 by employing the Es-Network to calculate the estimated mel spectrogram residual, and setting it as an additional prediction task of Tacotron 2, to allow the model focus more on predicting the individual features of mel spectrogram. The experience shows that compared to the original Tacotron 2 model, Es-Tacotron2 can produce more variable decoder output and synthesize more natural and expressive speech. View Full-Text
Keywords: speech synthesis; over-smoothness problem; estimated network; multi-task learning; end-to-end speech synthesis; over-smoothness problem; estimated network; multi-task learning; end-to-end
Show Figures

Graphical abstract

MDPI and ACS Style

Liu, Y.; Zheng, J. Es-Tacotron2: Multi-Task Tacotron 2 with Pre-Trained Estimated Network for Reducing the Over-Smoothness Problem. Information 2019, 10, 131.

Show more citation formats Show less citations formats
Note that from the first issue of 2016, MDPI journals use article numbers instead of page numbers. See further details here.

Article Access Map by Country/Region

1
Back to TopTop