Sign in to use this feature.

Years

Between: -

Subjects

remove_circle_outline

Journals

Article Types

Countries / Regions

Search Results (1)

Search Parameters:
Keywords = expressive voice cloning

Order results
Result details
Results per page
Select all
Export citation of selected articles as:
16 pages, 2854 KiB  
Article
ZSE-VITS: A Zero-Shot Expressive Voice Cloning Method Based on VITS
by Jiaxin Li and Lianhai Zhang
Electronics 2023, 12(4), 820; https://doi.org/10.3390/electronics12040820 - 6 Feb 2023
Cited by 6 | Viewed by 10685
Abstract
Voice cloning aims to synthesize the voice with a new speaker’s timbre from a small amount of the new speaker’s speech. Current voice cloning methods, which focus on modeling speaker timbre, can synthesize speech with similar speaker timbres. However, the prosody of these [...] Read more.
Voice cloning aims to synthesize the voice with a new speaker’s timbre from a small amount of the new speaker’s speech. Current voice cloning methods, which focus on modeling speaker timbre, can synthesize speech with similar speaker timbres. However, the prosody of these methods is flat, lacking expressiveness and the ability to control the expressiveness of cloned speech. To solve this problem, we propose a novel method ZSE-VITS (zero-shot expressive VITS) based on the end-to-end speech synthesis model VITS. Specifically, we use VITS as the backbone network and add the speaker recognition model TitaNet as the speaker encoder to realize zero-shot voice cloning. We use explicit prosody information to avoid effects from the speaker information and adjust speech prosody using the prosody information prediction and prosody fusion methods directly. We widen the pitch distribution of the train datasets using pitch augmentation to improve the generalization ability of the prosody model, and we fine-tune the prosody predictor alone in the emotion corpus to learn prosody prediction of various styles. The objective and subjective evaluations of the open datasets show that our method can generate more expressive speech and adjust prosody information artificially without affecting the similarity of speaker timbre. Full article
(This article belongs to the Section Artificial Intelligence)
Show Figures

Figure 1

Back to TopTop