High-Quality Text-to-Speech Implementation via Active Shallow Diffusion Mechanism

Deng, Junlin; Hou, Ruihan; Deng, Yan; Long, Yongqiu; Wu, Ning

doi:10.3390/s25030833

Open AccessArticle

High-Quality Text-to-Speech Implementation via Active Shallow Diffusion Mechanism

by

Junlin Deng

¹,

Ruihan Hou

¹,

Yan Deng

²,

Yongqiu Long

² and

Ning Wu

^1,*

¹

Key Laboratory of Beibu Gulf Offshore Engineering Equipment and Technology, Beibu Gulf University, Qinzhou 535011, China

²

School of Computer, Electronics and Information, Guangxi University, Nanning 530004, China

^*

Author to whom correspondence should be addressed.

Sensors 2025, 25(3), 833; https://doi.org/10.3390/s25030833

Submission received: 3 December 2024 / Revised: 27 January 2025 / Accepted: 28 January 2025 / Published: 30 January 2025

(This article belongs to the Special Issue Sensors and Machine-Learning Based Signal Processing)

Download

Browse Figures

Versions Notes

Abstract

Denoising diffusion probabilistic models (DDPMs) have proven to be useful in text-to-speech (TTS) tasks; however, it has been a challenge for traditional diffusion models to carry out real-time processing because of the need for hundreds of sampling steps during the iteration. In this work, a two-stage fast inference and efficient diffusion-based acoustic model of TTS, the Cascaded MixGAN-TTS (CMG-TTS), is proposed to address this problem. An active shallow diffusion mechanism is adopted to divide the CMG-TTS training process into two stages. Specifically, a basic acoustic model in the first stage is trained to provide valuable a priori knowledge for the second stage, and for the underlying acoustic modeling, a mixture combination mechanism-based linguistic encoder is introduced to work with pitch and energy predictors. In the following stage of processing, a post-net is used to optimize the mel-spectrogram reconstruction performance. The CMG-TTS is evaluated on datasets such as the AISHELL3 and LJSpeech, and the experiments show that the CMG-TTS achieves satisfactory results in both subjective and objective evaluation metrics with only one denoising step. Compared to other TTS models based on diffusion modeling, the CMG-TTS obtains a leading score in the real time factor (RTF), and both stages of the CMG-TTS are effective in the ablation studies.

Keywords:

text-to-speech; speech synthesis; diffusion probabilistic model; MixGAN; mel-spectrogram

1. Introduction

Deep learning-based speech synthesis techniques have played an increasingly important role in generating near-human voices, giving rise to many excellent models. Text-to-speech (TTS) synthesizes high-quality audio signals based on the input text sequences, with three models such as the text analysis frontend, the acoustic model, and the neural vocoder [1,2,3,4]. Before model training begins, the text sequences are passed through a text frontend, which is regularized and converted to phoneme sequences. The acoustic model converts the phoneme sequences into acoustic features of a time-domain spectrogram (e.g., mel-spectrogram). The neural vocoder converts the acoustic features into waveform features. A common training method is used to train the acoustic model and neural vocoder separately, and to synthesize the speech signal by passing the mel-spectrogram as an intermediate representation feature into the vocoder [5,6,7].

Neural networks based autoregressive models have demonstrated strong performance in TTS tasks, which use frame-by-frame prediction to summarize the predicted results of all frames to obtain the audio samples of the input text [8,9,10,11,12]. However, autoregressive models usually encounter problems such as word jumps, repetition, and slow inference, and non-autoregressive models have been the focus in order to address these challenges [13,14]. Non-autoregressive models usually use a forward feedback block structure to process the input phoneme sequences in parallel, which helps to increase the speed of synthesized audio. However, this parallel processing requires strict alignment between the text sequences and mel-spectrogram, which can be realized by using an external alignment tool such as the Montreal forced alignment (MFA), or a knowledge distillation approach [13,14]. Normalized flow and dynamic programming approaches can also be applied to the matching of monotonic alignment information between text sequences and a mel-spectrogram [15,16,17].

Denoising diffusion probabilistic models (DDPMs), as a powerful generative model, have recently been demonstrated to show strong modeling performance in image generation and speech synthesis [18,19,20,21]. The DDPMs are usually divided into two processing stages, which include diffusion and denoising, respectively. In the diffusion process, small random noise is combined with the data based on a T-step Markov chain, while in the denoising process the synthesized noise is gradually removed through a parameterized T-step Markov chain. The DiffSinger [22] was the first acoustic model to introduce the diffusion model to the application of music generation, in which the noise signal was converted into a mel-spectrogram conditioned by the music score. In order to further improve speech quality and speed up the inference, a shallow diffusion mechanism is used in the DiffSinger to make full use of the prior knowledge extracted from the basic model, and the boundary prediction method is trained to determine the denoising step T. In this regard, the number of denoising steps can be reduced to 70. In a following effort, the DiffGAN-TTS [23] adopted generative adversarial network (GAN) to model the denoising distribution, and an active shallow diffusion structure similar to the DiffSinger was applied to further reduce the denoising step to 1. It can be seen that diffusion-based speech synthesis models are able to achieve stable and efficient training by optimizing the evidence lower bound (ELBO).

In this research, we concentrate on the acoustic modeling of speech, employing a two-stage architecture named Cascaded MixGAN-TTS (CMG-TTS) to optimize the performance of synthesized audio and the efficiency of model inference. The CMG-TTS incorporates a linguistic encoder featuring a mixture alignment structure. Additionally, pitch and energy predictors were integrated into this encoder, which significantly contributed to enhancing the prosody of the output audio. To improve the reconstructability of the mel-spectrogram, a post-net mechanism grounded in a deep convolutional network was implemented. Regarding the denoising process, the GAN structure was adopted. This approach effectively circumvents the need for the numerous denoising steps that are typical in diffusion models when using Gaussian functions. Moreover, an active shallow diffusion mechanism was applied to further curtail the required number of denoising steps. The CMG-TTS was evaluated on the AISHELL3 [24] and LJSpeech datasets, respectively. The experimental results demonstrate that even with only a single denoising step, satisfactory outcomes can be achieved in terms of the predicted mel-spectrogram, attention alignment, and audio quality. These findings not only validate the effectiveness of the proposed CMG-TTS architecture but also provide valuable insights for future research in the field of speech synthesis.

The rest of the paper is organized as follows: Section 2 shows the general structure of diffusion models, and Section 3 proposes the theory of the CMG-TTS model. Section 4 demonstrates the training and testing processes of the CMG-TTS, and Section 5 gives a discussion of the results. Finally, the conclusions are given in Section 6.

2. Diffusion Model

The diffusion model includes a diffusion step followed by a denoising process. The diffusion process is a forward process, in which the data are influenced by small random noise until the data are completely distorted after T steps. The denoising process is an inverse process, and the polluted data are then recovered by learning a denoising function to gradually remove the added noise. Diffusion models usually require thousands of denoising steps to achieve the expected results, and the detailed process of the diffusion model is shown in Figure 1.

Diffusion process: In the diffusion step, a small amount of random Gaussian noise is added to the initial data

x_{0}

step by step, and the resulting collapsed data

x_{T}

are obtained. As shown in Equations (1) and (2), the diffusion process is based on a given predefined variance schedule

β_{t}

, with independently distributed variance

β_{1 : T}

at each step.

q (x_{1 : T} | x_{0}) = \prod_{t \geq 1} q (x_{t} | x_{t - 1})

(1)

q (x_{t} | x_{t - 1}) = N (x_{t}; \sqrt{1 - β_{t}} x_{t - 1}, β_{t} I)

(2)

where

q (x_{1 : T} | x_{0})

denotes the diffusion process from

x_{0}

to

x_{T}

, and

N

represents the normal distribution.

Denoising process: In the denoising process,

x_{T - 1 : 0}

is denoised iteratively from

x_{T}

to obtain the final recovered data

x_{0}

, which is modeled by the parameterized

θ

. Equations (3) and (4) express the denoising process, such that

p_{θ} (x_{0 : T}) = p (x_{T}) \prod_{t \geq 1} p_{θ} (x_{t - 1} | x_{t})

(3)

p_{θ} (x_{t - 1} | x_{t}) = N (x_{t - 1}; μ_{θ} (x_{t}, t), σ_{t}^{2} I)

(4)

where

p_{θ} (x_{0 : T})

denotes the stepwise elimination of Gaussian noise from the diffusion sample

x_{T}

. To obtain

x_{0}

,

μ_{θ} (x_{t}, t)

and

σ_{t}^{2}

denote the mean and variance, respectively.

In general, a Gaussian function is used to model the denoising distribution

p_{θ} (x_{t - 1} | x_{t})

. The diffusion model can be optimized through maximizing the likelihood of

p_{θ} (x_{0})

and the evidence lower bound (ELBO

\leq \log p_{θ} (x_{0})

). Under the impetus of the ELBO, the diffusion model compels

q (x_{t - 1} | x_{t})

to converge towards the actual denoising distribution

p_{θ} (x_{t - 1} | x_{t})

. Mathematically, the ELBO is expressed as follows:

ELBO = \sum_{t \geq 1} E_{q (x_{t})} [D_{KL} (q (x_{t - 1} {| x}_{t}) | | p_{θ} (x_{t - 1} | x_{t}))] + C

(5)

where

D_{K L}

represents the Kullback–Leibler (KL) scatter, and C denotes the constant term that does not depend on

θ

.

3. The CMG-TTS Model

Based on the diffusion model of GAN, the number of denoising steps can be further reduced by using an active shallow diffusion mechanism. This section describes the design motivation and the components of the CMG-TTS model, as well as the training loss.

3.1. Motivation

Diffusion models show powerful modeling capabilities for signal processing; however, they suffer from difficulty in the real-time processing of iterative Gaussian application [20]. There have been a number of studies focused on the synthesis speed of diffusion models, and one of the effective approaches is to use a GAN instead of a Gaussian function for denoising distribution [21,23]. DiffGAN-TTS is a model of this kind that can achieve fast and high-quality synthesized audio. The active shallow diffusion mechanism of the DiffGAN-TTS method can be further considered by introducing a two-stage cascaded training structure. The underlying acoustic model in the first stage of processing is designed to provide strong a priori knowledge for the diffusion model in the following stage. DiffGAN-TTS uses the classic FastSpeech2 for the underlying acoustic model, which suffers from hard alignment between the phoneme and mel-spectrogram.

In this work, we introduce the DiffGAN-TTS with an active shallow diffusion structure to optimize the quality of the synthesized audio, and its training process can be explained with two stages. The underlying acoustic model in the first stage provides strong a priori knowledge for the diffusion model in the second stage. In terms of the basic acoustic model, we consider a linguistic encoder with pitch and energy predictors based on a mixture of alignment mechanisms, employing soft alignment of phonemes and hard alignment at the word level. Meanwhile, the mel-spectrogram reconstruction capability is investigated by introducing a post-net convolution network. Figure 2 shows the structure of the CMG-TTS scheme. It can be seen that the CMG-TTS contains a linguistic encoder, a transformer decoder, a post-net, a discriminator, and a diffusion decoder. In the first stage, the basic acoustic model in the CMG-TTS is trained to generate a coarse mel-spectrogram, as shown in the dashed box in Figure 2a, and then the coarse mel-spectrogram is fed to the input of the diffusion decoder to provide strong prior knowledge for training the denoising model in the second stage; in this way, the number of denoising steps T can be further reduced.

3.2. The Basic Model Architecture

In this work, we introduce an active shallow diffusion mechanism that divides the model training into two stages. Specifically, we first train a basic acoustic model to provide strong a priori knowledge for training the denoising model in the second stage, which further reduces the number of denoising steps, T. Figure 3 shows the structure of the basic acoustic model of the CMG-TTS, which includes a linguistic encoder, a transformer decoder, and a post-net. The linguistic encoder processes the phoneme sequences into phoneme hidden sequences, and then the decoder converts the phoneme hidden sequences into a mel-spectrogram. The post-net further optimizes the mel-spectrogram and enhances the reconstruction capability.

Linguistic Encoder and Transformer Decoder: Figure 3a shows the structure of the linguistic encoder, in which “LR” denotes the length regulator, “WP” is the word-level pooling, and the sinusoidal-like symbol denotes relative position encoding [25]. The linguistic encoder contains the components of phoneme encoder, pitch predictor, energy predictor, word encoder, duration predictor, and word-phoneme attention model. The phoneme encoder, word encoder, and transformer decoder share a similar model structure based on a forward feedback transformer (FFT). Figure 3b shows the structure of the FFT, which contains multi-head self-attention, Dropout, and linear normalization (LN). It is obvious that a similar structure is applied to all the pitch, energy, and duration predictors, with a 2-layer 1D convolution network equipped with ReLU activation, linear normalization, and Dropout. The hidden states are projected into the output sequence with an extra linear layer.

Post-net: Figure 3c depicts a schematic diagram of the post-net architecture, which consists of a 5-layer 1D convolutional network. Each layer incorporates 512 units of 5 × 1 convolutional kernels. Subsequent to the convolution operation in each layer, batch normalization (BN) is applied to standardize the input features across the mini-batch, effectively accelerating the training process and reducing the risk of overfitting. This is followed by a dropout operation, which randomly “drops out” a fraction of the input units, further enhancing the model’s generalization ability. The Tanh activation function is then employed to introduce non-linearity into the network, enabling it to learn complex patterns in the data. Finally, the mel-spectrogram output is derived from the last layer through a linear mapping operation. This linear transformation projects the high-dimensional feature representation learned by the post-net onto the appropriate mel-spectrogram space.

3.3. Diffusion Decoder and Discriminator

In the CMG-TTS, a diffusion decoder is designed for the denoising distribution using conditional GAN. The speed of model synthesis is accelerated by increasing the denoising steps size and reducing the denoising processes. Figure 4 demonstrates the structure of the diffusion decoder. The underlying structure contains 20 non-causal residual blocks with hidden states of 256. The diffusion decoder receives the output of the basic acoustic model, and the diffusion process encodes t, the noise mel-spectrogram

x_{T}

, and the speaker embedding s. The final output from the diffusion decoder can be calculated by sequentially examining the residual block with the skip connection layer, alternating between the convolution network and ReLU function, respectively.

In each denoising step

D_{a d v}

, a discriminator is applied to compute the convergence between the actual denoising distribution

q (x_{t - 1} | x_{t})

and the predicted denoising distribution

p_{θ} (x_{t - 1} | x_{t})

, which is trained by the least-squares GAN (LS-GAN) loss [26]. In the joint conditional and unconditional loss (JCU Loss) [27], the accuracy of the mel-spectrogram and waveform mapping can be further improved. The discriminator is a CNN structure and the Conv1D block contains a 3-layer 1D convolution network and a LeakyReLU activation function, as shown in Figure 2b. The conditional and unconditional blocks share the same structure and consist of a 2-layer 1D convolution network. The discriminator can be modeled by

D_{φ} (x_{t - 1}, x_{t}, t, s)

, taking the input of the data with noise

x_{t}

, the predicted mel-spectrogram

x_{t - 1}

, the diffusion step with t, and the speaker with s.

3.4. Active Shallow Diffusion Mechanism

The loss function of many speech synthesis models is based on a single mean square error (MSE) or mean absolute error (MAE); however, there have been over-smoothing problems due to incorrect uni-modal distribution hypothesis of data [23]. In order to improve the performance of model synthesis, an active shallow diffusion mechanism is introduced in the CMG-TTS, in which the basis acoustic model with parameter

φ

is defined as

G_{φ}^{b a s i c} (y, s)

and trained using the minimization scheme, as follows:

\min \sum_{t \geq 0} E_{q (x_{t})} [D i v (q_{d i f f}^{t} (G_{φ}^{b a s i c} (y, s)), q_{d i f f}^{t} (x_{0}))]

(6)

where

D i v (\cdot)

measures the scatter between the predicted and the actual values.

q_{d i f f}^{t} (\cdot)

denotes the diffusion sampling function at step t,

E_{q (x_{t})}

represents the expectation with respect to the diffused samples, and

x_{t} = q_{d i f f}^{t} (x_{0})

means that

x_{0}

is applied with the diffusion sampling function at step t to obtain the diffusion sample

x_{t}

.

Figure 5 shows the two-stage cascaded training scheme. In this structure, the noise samples generated by the basic acoustic model are trained to be as close to the diffuse samples in the real data as possible by continuously reducing the scatter between these samples. In the second stage, the pre-training weights of the basic acoustic model are first replicated to initialize the corresponding weights following a freezing operation. The diffusion decoder then receives the coarse mel-spectrogram

x_{0}^{*}

and continues the training of the diffusion sampling and the denoising process. In the inference stage, the basic acoustic model produces the coarse mel-spectrogram

x_{0}^{*}

, and the diffusion sample

x_{1}^{*}

is obtained through a diffusion step. The diffusion decoder takes the diffusion sample

x_{1}^{*}

as a priori knowledge and obtains the final output

x_{0}^{'}

through a denoising step.

3.5. Training Loss

The CMG-TTS is trained based on the generator loss as well as discriminator loss. The generator loss consists of the feature matching loss of

L_{f m}

[28], acoustic reconstruction loss

L_{r e c o n}

, and denoising convergence loss

L_{a d v}

, as defined in Equations (7)–(9), respectively.

L_{f m} = E_{q (x_{t})} [\sum_{i = 1}^{N} | | D_{φ}^{i} (x_{t - 1}, x_{t}, t, s) - D_{φ}^{i} (x_{t - 1}^{'}, x_{t}, t, s) | |_{1}]

(7)

L_{r e c o n} = L_{m e l} + L_{p o s t n e t} + λ_{d} L_{d u r a t i o n} + λ_{p} L_{p i t c h} + λ_{e} L_{e n e r g y} + L_{h e l p e r}

(8)

L_{a d v} = \sum_{t \geq 1} E_{q (x_{t})} E_{p_{θ} (x_{t - 1}, x_{t})} [{(D_{φ} (x_{t - 1}, x_{t}, t, s) - 1)}^{2}]

(9)

where N denotes the number of hidden layers in the discriminator.

λ_{d}

,

λ_{p}

, and

λ_{e}

denote the setting loss weights, and they are all set to 0.1.

L_{m e l}

and

L_{p o s t n e t}

are based on the MAE loss,

L_{d u r a t i o n}

,

L_{p i t c h}

and

L_{e n e r g y}

are based on the MSE loss, and

L_{h e l p e r}

is based on the Guided Attention Loss [29].

L_{f m}

is a similarity metric in the discriminator that distinguishes the real and generated data, and is obtained by accumulating the

l_{1}

distance between them.

L_{r e c o n}

represents the basis reconstruction loss, and

L_{a d v}

indicates the convergence between the actual denoising distribution

q (x_{t - 1} | x_{t})

and the denoising model distribution

p_{θ} (x_{t - 1} | x_{t})

. The generator is trained by minimizing

L_{G}

, such that

L_{G} = L_{f m} + L_{r e c o n} + L_{a d v}

(10)

In the second stage of training, the basis reconstruction loss

L_{r e c o n}

is set to 0, except for

L_{m e l}

, in order to highlight the contribution of the diffusion model. The discriminator is optimized by minimizing

L_{D}

, such that

L_{D} = \sum_{t \geq 1} E_{q (x_{t}) q (x_{t - 1} | x_{t})} [{(D_{φ} (x_{t - 1}, x_{t}, t, s) - 1)}^{2}] + E_{p_{θ} (x_{t - 1} | x_{t})} [D_{φ} {(x_{t - 1}, x_{t}, t, s)}^{2}]

(11)

4. Experiments

In this section, the model configuration and the datasets for testing are introduced to verify the validity of the proposed model.

4.1. Experimental Setup

Datasets: The performance of the CMG-TTS model was comprehensively evaluated on two benchmark datasets, namely AISHELL3 and LJSpeech. The AISHELL3 dataset encompasses a total of 88,035 audio segments, contributed by 218 native Mandarin speakers, with an aggregate duration of 85 h. This dataset represents a diverse range of Mandarin speech characteristics, making it suitable for evaluating the model’s performance in tonal languages. On the other hand, the LJSpeech dataset contains 13,100 English audio clips, all originating from a single speaker, with an approximate total duration of 24 h. The LJSpeech dataset is widely used for evaluating the performance of speech-related models in the context of non-tonal languages, especially for English-language tasks. For the purpose of model training and evaluation, a stratified random sampling approach was employed. Specifically, 87,011 samples were randomly selected from the AISHELL3 dataset, and 12,076 samples were randomly chosen from the LJSpeech dataset to form the training set. This sampling strategy ensured that the training set was representative of the overall data distribution in each dataset. For validation and testing, 512 samples were randomly selected from each of the two datasets for both the validation set and the test set. This balanced sampling across datasets enabled a fair and accurate assessment of the model’s generalization ability. To prepare the data for model training, text sequences were first transformed into phoneme sequences. This was achieved using well-established libraries such as pypinyin for Mandarin Chinese, which accurately converted Chinese characters into their corresponding pinyin phonetic notations, and g2p_en for English, which mapped English graphemes to phonemes. Subsequently, mel-spectrograms were generated from the original audio waveforms. The sampling rate was set to 22,050 Hz, which is a commonly used rate in speech processing tasks, ensuring sufficient frequency resolution. With a frame length of 1024 samples, a hop length of 256 samples, and a frequency bin size of 80, the mel-spectrograms were calculated. This configuration effectively captured the short-term spectral characteristics of the audio signals, providing a suitable input representation for the CMG-TTS model.

The training setup: The CMG-TTS model was trained on a single NVIDIA 3060 GPU, with a batch size of 16 on both the AISHELL3 and LJSpeech datasets. A gradually decaying learning rate was used to train the CMG-TTS, with the initial learning rates set to

10^{- 3}

and

2 \times 10^{- 3}

for the generator and discriminator, respectively. An Adam optimizer was applied to training the two-stage cascaded scheme with

β_{1}

= 0.9,

β_{2}

= 0.98, and

ϵ

=

10^{- 9}

in the first train stage and

β_{1}

= 0.5,

β_{2}

= 0.9, and

ϵ

=

10^{- 9}

in the second stage. For the AISHELL3 dataset, the basic acoustic model in the first stage reached convergence after 360 k steps, and the diffusion model in the second stage reached convergence after 800 k steps. For the LJSpeech dataset, the first and second stages were trained for 300 k and 700 k steps, respectively, before convergence was reached. During our experiments, all models were used to obtain the final audio samples using the HiFi-GAN vocoder. This experiment was carried out on a CUDA version of 11.6 with Python 3.8 and Pytorch 1.8.0+cu111.

The evaluation method: The model performance could be measured using mean opinion score (MOS) [30], with the scores ranging from 1 to 5 with a 95 percent confidence interval. The quality of the modeled synthesized audio was further measured with objective evaluation metrics methods such as the mel-cepstral distortion (MCD) [31], F₀ root mean squared error (F₀ RMSE), perceptual evaluation of speech quality (PESQ) [32], short-time objective intelligibility (STOI) [33], segment signal-to-noise ratio (SegSNR), and real time factor (RTF). The RTF shows the speed of inference on a single GPU, giving the time needed for one second of audio. During the MOS, MCD, and F₀ RMSE evaluations, the sample rate of the audio was set to 22,050 Hz. For the PESQ, STOI, and SegSNR, the sampling rate was set to 16,000 Hz to satisfy the computation needs.

4.2. Experimental Results

The tests were performed on the FastSpeech2, PortaSpeech, DiffSpeech (T = 64), DiffGAN-TTS (T = 4), and DiffGAN-TTS (two-stage) to compare with the proposed scheme; all these models were trained and reasoned on the basis of publicly available code, and these hyperparameters were all tuned by the original author. DiffSpeech denotes the DiffSinger applied to the TTS domain using the shallow diffusion mechanism. In the experiments, the number of denoising steps for DiffSpeech was set to 64. DiffGAN-TTS (T = 4) indicated that the training process was carried out using four steps of denoising, and the DiffGAN-TTS represented the use of a two-stage training scheme. The results of the experiments on the AISHELL3 and LJSpeech datasets are shown in Table 1 and Table 2.

Audio Quality: The experimental results on both datasets show that the CMG-TTS achieved the highest audio quality with only one denoising step. The CMG-TTS obtained MOS scores of 4.03 and 4.08, respectively, outperforming other TTS models in the tests. Meanwhile, the CMG-TTS achieved satisfactory performance in the objective metrics of MCD, F₀ RMSE, PESQ, STOI, and SegSNR.

RTF: The CMG-TTS showed an efficient sampling capability to synthesize a high-fidelity mel-spectrogram in only one denoising step. The RTF value of the different TTS models was evaluated on the two datasets, and it can be seen that the CMG-TTS effectively reduced the inference time compared to other diffusion-based TTS models.

Visualizations: The mel-spectrograms of the CMG-TTS and other TTS models were compared on the AISHELL3 dataset under the same text sequence conditions as shown in Figure 6. The input phoneme sequence was “w uo3 m en5 d uei4 y i2 zh e4 zh ong3 sh iii4 q ing5 b u2 h uei4 t ai4 z ai4 y i4”. The CMG-TTS exhibited competitive performance in the low- and medium-frequency regions, while maintaining high quality in the high-frequency regions. The attention convergence graph during the training is presented in Figure 6h, and it can be seen that the attention alignment was clear and smooth, which indicates that the CMG-TTS achieved satisfactory performance in aligning the text sequences with the mel-spectrogram.

4.3. Ablation Tests

Ablation tests also had to be performed on the CMG-TTS to validate the effectiveness of the individual structures, including the post-net and active shallow diffusion mechanism. In the ablation test, the CMG-TTS was trained with only the active shallow diffusion model removed, which would however result in unmanageable errors. Table 3 shows the test results on the AISHELL3 dataset, and it demonstrates that the CMG-TTS elimination of the post-net resulted in a significant degradation of the audio quality. Meanwhile, it is obvious that there was a small reduction in the audio quality of the CMG-TTS after eliminating both the post-net and the active shallow diffusion structure. Our ablation studies validate the CMG-TTS structure.

5. Discussion

In previous studies, end-to-end speech synthesis models have been an important task in the field of TTS, and fully end-to-end models have been the focus. For example, FastSpeech2s [14] adopts an auxiliary mel-spectrogram decoder and adversarial training to learn text representations, and a special spectral loss was used to mitigate the length mismatch between the target and generated speech. In a different effort, EATS [34] uses adversarial training and differentiable alignment methods while employing a soft dynamic time warping loss computed by dynamic programming to mitigate the length mismatch that exists between the generated speech and the target speech. Another model named EFTS-Wav [35] also used an auxiliary mel-spectrogram decoder for alignment learning. On this basis, the EFTS-Wav proposes an index mapping vector in the monotone alignment model, which led to the design of a novel monotone alignment mechanism. In a parallel effort, the Wave-tacotron [36] combines Tacotron2 and normalized flow to simplify maximizing the likelihood of the training data, and VITS [37] learns text alignment during training that maximizes the likelihood of data while improving expressiveness by leveraging variational reasoning and normalization flow in an adversarial structure. JETS [38] unites the FastSpeech2 and HiFi-GAN vocoder, and it designs a novel token duration that does not require an additional external speech-to-text alignment model. In this effort, the proposed CMG-TTS achieved better results in the evaluations; however, from the structure of the CMG-TTS, it can be seen that it cannot be considered as an end-to-end model, since a vocoder was still required to synthesize the final audio sample. Therefore, the CMG-TTS must be trained separately from the vocoder, which introduces additional training costs and error accumulation. In contrast, end-to-end models can generate audio in parallel in a short period of time and can reduce the error accumulation without extra time. Therefore, it is still necessary to study the scheme of a fully end-to-end training mechanism that does not affect the performance of CMG-TTS.

6. Conclusions

In this work, a two-stage processing scheme, the CMG-TTS, was introduced as an efficient speech synthesis model with an active shallow diffusion mechanism. In the first stage of processing, the CMG-TTS was trained to learn diffusion samples from real audio samples by training the underlying acoustic model, in which a linguistic encoder with a mixture alignment structure was used as the pitch and energy predictors. Then, a post-net with a 5-layer 1D convolution network was trained to optimize the reconstruction performance of the mel-spectrogram. In the second stage, a diffusion decoder was adopted to reduce the noise of the coarse samples and to obtain a high-fidelity mel-spectrogram. Finally, a HiFi-GAN vocoder was applied to convert the mel-spectrogram to the final audio output. The performance of the proposed CMG-TTS was evaluated on both the AISHELL3 and LJSpeech datasets. Specifically, a subjective assessment metric like MOS was used for the performance evaluation, in addition to objective assessment metrics such as the MCD, F0 RMSE, PESQ, STOI and SegSNR. The experimental results show that the CMG-TTS is able to synthesize high-fidelity audio samples with only one denoising step. Ablation tests were also carried out to assess the performance of the CMG-TTS, and it was proved that each of the structures in the CMG-TTS were effective.

Author Contributions

Software, Y.D.; Validation; R.H.; Writing—original draft, Y.D. and N.W.; Formal analysis: Y.L.; Conceptualization, J.D.; Supervision investigation, N.W.; Resources, J.D.; Writing—review and editing, R.H.; Supervision, N.W.; Methodology, J.D.; Project administration: J.D.; Funding acquisition, J.D. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported partially by the National Natural Science Foundation of China (52161042), Guangxi Science and Technology Major Program (2024AA29055); and the 100 Scholar Plan of the Guangxi Zhuang Autonomous Region of China (Grant No. 2018).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data are available upon request.

Conflicts of Interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Deng, Y.; Wu, N.; Qiu, C.; Luo, Y.; Chen, Y. MixGAN-TTS: Efficient and Stable Speech synthesis Based on Diffusion Model. IEEE Access 2023, 11, 57674–57682. [Google Scholar] [CrossRef]
Deng, Y.; Wu, N.; Qiu, C.; Chen, Y.; Gao, X. Research on Speech Synthesis Based on Mixture Alignment Mechanism. Sensors 2023, 23, 7283. [Google Scholar] [CrossRef]
Chen, J.; Song, X.; Peng, Z.; Zhang, B.; Pan, F.; Wu, Z. LightGrad: Lightweight Diffusion Probabilistic Model for Text-to-Speech. In Proceedings of the ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4–10 June 2023; pp. 1–5. [Google Scholar]
Guo, Y.; Du, C.; Ma, Z.; Chen, X.; Yu, K. VoiceFlow: Efficient Text-to-Speech with Rectified Flow Matching. arXiv 2023, arXiv:2309.05027. [Google Scholar]
Oord, A.V.D.; Dieleman, S.; Zen, H.; Simonyan, K.; Vinyals, O.; Graves, A.; Kalchbrenner, N.; Senior, A.; Kavukcuoglu, K. Wavenet: A generative model for raw audio. arXiv 2016, arXiv:1609.03499. [Google Scholar]
Kong, J.; Kim, J.; Bae, J. Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis. Adv. Neural Inf. Process. Syst. 2020, 33, 17022–17033. [Google Scholar]
Prenger, R.; Valle, R.; Catanzaro, B. Waveglow: A flow-based generative network for speech synthesis. In Proceedings of the ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK, 12–17 May 2019; pp. 3617–3621. [Google Scholar]
Wang, Y.; Skerry-Ryan, R.J.; Stanton, D.; Wu, Y.; Weiss, R.J.; Jaitly, N.; Yang, Z.; Xiao, Y.; Chen, Z.; Bengio, S.; et al. Tacotron: Towards end-to-end speech synthesis. arXiv 2017, arXiv:1703.10135. [Google Scholar]
Shen, J.; Pang, R.; Weiss, R.J.; Schuster, M.; Jaitly, N.; Yang, Z.; Chen, Z.; Zhang, Y.; Wang, Y.; Skerrv-Ryan, R.; et al. Natural tts synthesis by conditioning wavenet on mel spectrogram predictions. In Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AL, Canada, 15–20 April 2018; pp. 4779–4783. [Google Scholar]
Gibiansky, A.; Arik, S.; Diamos, G.; Miller, J.; Peng, K.; Ping, W.; Raiman, J.; Zhou, Y. Deep voice 2: Multi-speaker neural text-to-speech. Adv. Neural Inf. Process. Syst. 2017, 30. Available online: https://proceedings.neurips.cc/paper_files/paper/2017/hash/c59b469d724f7919b7d35514184fdc0f-Abstract.html (accessed on 27 January 2025).
Ping, W.; Peng, K.; Gibiansky, A.; Arik, S.O.; Kannan, A.; Narang, S.; Raiman, J.; Miller, J. Deep voice 3: Scaling text-to-speech with convolutional sequence learning. arXiv 2017, arXiv:1710.07654. [Google Scholar]
Li, N.; Liu, S.; Liu, Y.; Zhao, S.; Liu, M. Neural speech synthesis with transformer network. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019; Volume 33, pp. 6706–6713. [Google Scholar]
Ren, Y.; Runa, Y.; Tan, X.; Qin, T.; Zhao, S.; Zhao, Z.; Liu, T.-Y. FastSpeech: Fast, robust and controllable text to speech. Adv. Neural Inf. Process. Syst. 2019, 32. Available online: https://arxiv.org/pdf/1905.09263 (accessed on 27 January 2025).
Ren, Y.; Hu, C.; Tan, X.; Qin, T.; Zhao, S.; Zhao, Z.; Liu, T.Y. Fastspeech 2: Fast and high-quality end-to-end text to speech. arXiv 2020, arXiv:2006.04558. [Google Scholar]
Ren, Y.; Liu, J.; Zhao, Z. PortaSpeech: Portable and high-quality generative text-to-speech. Adv. Neural Inf. Process. Syst. 2021, 34, 13963–13974. [Google Scholar]
Kim, J.; Kim, S.; Kong, J.; Yoon, S. Glow-tts: A generative flow for text-to-speech via monotonic alignment search. Adv. Neural Inf. Process. Syst. 2020, 33, 8067–8077. [Google Scholar]
Miao, C.; Liang, S.; Chen, M.; Ma, J.; Wang, S.; Xiao, J. Flow-tts: A non-autoregressive network for text to speech based on flow. In Proceedings of the ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Virtual, 4–9 May 2020. [Google Scholar]
Lee, S.G.; Kim, H.; Shin, C.; Tan, X.; Liu, C.; Meng, Q.; Qin, T.; Chen, W.; Yoon, S.; Liu, T.Y. Priorgrad: Improving conditional denoising diffusion models with data-dependent adaptive prior. arXiv 2021, arXiv:2106.06406. [Google Scholar]
Chen, Z.; Wu, Y.; Leng, Y.; Chen, J.; Liu, H.; Tan, X.; Cui, Y.; Wang, K.; He, L. Resgrad: Residual denoising diffusion probabilistic models for text to speech. arXiv 2022, arXiv:2212.14518. [Google Scholar]
Ho, J.; Jain, A.; Abbeel, P. Denoising diffusion probabilistic models. Adv. Neural Inf. Process. Syst. 2020, 33, 6840–6851. [Google Scholar]
Xiao, Z.; Kreis, K.; Vahdat, A. Tackling the generative learning trilemma with denoising diffusion gans. arXiv 2021, arXiv:2112.07804. [Google Scholar]
Liu, J.; Li, C.; Ren, Y.; Chen, F.; Zhao, Z. Diffsinger: Singing voice synthesis via shallow diffusion mechanism. In Proceedings of the AAAI Conference on Artificial Intelligence, Online, 22 February–1 March 2022; Volume 36, pp. 11020–11028. [Google Scholar]
Liu, S.; Su, D.; Yu, D. DiffGAN-TTS: High-Fidelity and Efficient Text-to-Speech with Denoising Diffusion GANs. arXiv 2022, arXiv:2201.11972. [Google Scholar]
Shi, Y.; Bu, H.; Xu, X.; Zhang, S.; Li, M. Aishell-3: A multi-speaker mandarin tts corpus and the baselines. arXiv 2020, arXiv:2010.11567. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30. [Google Scholar]
Mao, X.; Li, Q.; Xie, H.; Lau, R.Y.; Wang, Z.; Smolley, S.P. Least squares generative adversarial networks. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2794–2802. [Google Scholar]
Yang, J.; Bae, J.-S.; Bak, T.; Kim, Y.; Cho, H.-Y. Ganspeech: Adversarial training for high-fidelity multi-speaker speech synthesis. arXiv 2021, arXiv:2106.15153. [Google Scholar]
Larsen, A.B.L.; Sønderby, S.K.; Larochelle, H.; Winther, O. Autoen-coding beyond pixels using a learned similarity metric. In Proceedings of the International Conference on Machine Learning, New York City, NY, USA, 19–24 June 2016; pp. 1558–1566. [Google Scholar]
Tachibana, H.; Uenoyama, K.; Aihara, S. Efficiently trainable text-to-speech system based on deep convolutional networks with guided attention. In Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AL, Canada, 15–20 April 2018; pp. 4784–4788. [Google Scholar]
Chu, M.; Peng, H. Objective Measure for Estimating Mean Opinion Score of Synthesized Speech. US Patent 7,024,362, 4 April 2006. [Google Scholar]
Kubichek, R. Mel-cepstral distance measure for objective speech quality assessment. In Proceedings of the IEEE Pacific Rim Conference on Communications Computers and Signal Processing, Victoria, BC, Canada, 19–21 May 1993; Volume 1, pp. 125–128. [Google Scholar]
Rix, A.W.; Beerends, J.G.; Hollier, M.P.; Hekstra, A.P. Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs. In Proceedings of the 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing, Salt Lake City, UT, USA, 7–11 May 2001, Proceedings (Cat. No. 01CH37221); IEEE: New York, NY, USA, 2001; Volume 2, pp. 749–752. [Google Scholar]
Taal, C.H.; Hendriks, R.C.; Heusdens, R.; Jensen, J. A short-time objective intelligibility measure for time-frequency weighted noisy speech. In Proceedings of the 2010 IEEE International Conference on Acoustics, Speech and Signal Processing, Dallas, TX, USA, 14–19 March 2010; pp. 4214–4217. [Google Scholar]
Donahue, J.; Dieleman, S.; Bińkowski, M.; Elsen, E.; Simonyan, K. End-to-end adversarial text-to-speech. arXiv 2020, arXiv:2006.03575. [Google Scholar]
Miao, C.; Shuang, L.; Liu, Z.; Minchuan, C.; Ma, J.; Wang, S.; Xiao, J. Efficients: An efficient and high-quality text-to-speech architecture. In Proceedings of the International Conference on Machine Learning, Virtual, 18–24 July 2021; pp. 7700–7709. [Google Scholar]
Miao, C.; Shuang, L.; Liu, Z.; Minchuan, C.; Ma, J.; Wang, S.; Xiao, J. Wave-tacotron: Spectrogram-free end-to-end text-to-speech synthesis. In Proceedings of the ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–11 June 2021; pp. 5679–5683. [Google Scholar]
Kim, J.; Kong, J.; Son, J. Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech. In Proceedings of the International Conference on Machine Learning, Virtual, 18–24 July 2021; pp. 5530–5540. [Google Scholar]
Lim, D.; Jung, S.; Kim, E. JETS: Jointly training FastSpeech2 and HiFi-GAN for end to end text to speech. arXiv 2022, arXiv:2203.16852. [Google Scholar]

Figure 1. The directed graph for diffusion model.

Figure 2. The overall architecture for the CMG-TTS scheme. The dashed box indicates the training stage of the basic acoustic model.

Figure 3. The basic acoustic structure of the CMG-TTS.

Figure 4. The architecture for diffusion decoder.

Figure 5. The two-stage cascaded training scheme.

Figure 6. Visualization of the mel-spectrogram for different TTS models.

Table 1. Experimental results on the AISHELL3 dataset.

Method	MOS (↑)	MCD (↓)	F₀ RMSE (↓)	PESQ (↑)	STOI (↑)	SegSNR (↑)	RTF (↓)
Ground Truth	4.27 ± 0.06	-	-	-	-	-	-
FastSpeech2	3.83 ± 0.07	17.808	0.724	1.061	0.146	−8.271	0.096
PortaSpeech	4.02 ± 0.07	17.665	0.719	1.069	0.158	−8.121	0.115
DiffSpeech (T = 64)	4.04 ± 0.06	17.721	0.748	1.059	0.157	−8.191	0.184
DiffGAN-TTS (T = 4)	3.93 ± 0.06	17.746	0.745	1.058	0.161	−8.177	0.167
DiffGAN-TTS (two-stage)	3.89 ± 0.07	17.704	0.783	1.054	0.155	−8.124	0.144
CMG-TTS	4.03 ± 0.07	17.611	0.743	1.078	0.165	−7.858	0.121

Table 2. Experimental results on the LJSpeech dataset.

Method	MOS (↑)	MCD (↓)	F₀ RMSE (↓)	PESQ (↑)	STOI (↑)	SegSNR (↑)	RTF (↓)
Ground Truth	4.34 ± 0.07	-	-	-	-	-	-
FastSpeech2	3.94 ± 0.06	6.973	0.306	1.062	0.251	−6.049	0.044
PortaSpeech	4.06 ± 0.07	6.694	0.301	1.074	0.274	−5.781	0.071
DiffSpeech (T = 64)	4.09 ± 0.05	6.758	0.303	1.071	0.271	−5.793	0.126
DiffGAN-TTS (T = 4)	4.02 ± 0.06	6.801	0.309	1.599	0.268	−5.867	0.108
DiffGAN-TTS (two-stage)	3.99 ± 0.07	6.737	0.311	1.601	0.261	−5.691	0.096
CMG-TTS	4.08 ± 0.06	6.671	0.298	1.667	0.276	−5.661	0.087

Table 3. Ablation study results.

Method	MOS (↑)	MCD (↓)	F₀ RMSE (↓)	PESQ (↑)	STOI (↑)	SegSNR (↑)
CMG-TTS	4.03 ± 0.07	17.611	0.743	1.078	0.165	−7.858
Remove post-net	3.95 ± 0.07	17.707	0.776	1.063	0.146	−8.136
Remove post-net and two-stage	3.99 ± 0.06	17.645	0.756	1.075	0.168	−8.091

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Deng, J.; Hou, R.; Deng, Y.; Long, Y.; Wu, N. High-Quality Text-to-Speech Implementation via Active Shallow Diffusion Mechanism. Sensors 2025, 25, 833. https://doi.org/10.3390/s25030833

AMA Style

Deng J, Hou R, Deng Y, Long Y, Wu N. High-Quality Text-to-Speech Implementation via Active Shallow Diffusion Mechanism. Sensors. 2025; 25(3):833. https://doi.org/10.3390/s25030833

Chicago/Turabian Style

Deng, Junlin, Ruihan Hou, Yan Deng, Yongqiu Long, and Ning Wu. 2025. "High-Quality Text-to-Speech Implementation via Active Shallow Diffusion Mechanism" Sensors 25, no. 3: 833. https://doi.org/10.3390/s25030833

APA Style

Deng, J., Hou, R., Deng, Y., Long, Y., & Wu, N. (2025). High-Quality Text-to-Speech Implementation via Active Shallow Diffusion Mechanism. Sensors, 25(3), 833. https://doi.org/10.3390/s25030833

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

High-Quality Text-to-Speech Implementation via Active Shallow Diffusion Mechanism

Abstract

1. Introduction

2. Diffusion Model

3. The CMG-TTS Model

3.1. Motivation

3.2. The Basic Model Architecture

3.3. Diffusion Decoder and Discriminator

3.4. Active Shallow Diffusion Mechanism

3.5. Training Loss

4. Experiments

4.1. Experimental Setup

4.2. Experimental Results

4.3. Ablation Tests

5. Discussion

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI