Using Denoising Diffusion Model for Predicting Global Style Tokens in an Expressive Text-to-Speech System

Prosowicz, Wiktor; Hachaj, Tomasz

doi:10.3390/electronics14234759

Open AccessArticle

Using Denoising Diffusion Model for Predicting Global Style Tokens in an Expressive Text-to-Speech System

by

Wiktor Prosowicz

and

Tomasz Hachaj

^*

Faculty of Electrical Engineering, Automatics, Computer Science and Biomedical Engineering, AGH University of Krakow, Al. Mickiewicza 30, 30-059 Krakow, Poland

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(23), 4759; https://doi.org/10.3390/electronics14234759 (registering DOI)

Submission received: 31 October 2025 / Revised: 29 November 2025 / Accepted: 1 December 2025 / Published: 3 December 2025

(This article belongs to the Special Issue Advances in Algorithm Optimization and Computational Intelligence)

Download

Browse Figures

Versions Notes

Abstract

Text-to-speech (TTS) systems based on neural networks have undergone a significant evolution, taking a step forward towards achieving human-like quality and expressiveness, which is crucial for applications such as social media content creation and voice interfaces for visually impaired individuals. An entire branch of research, known as Expressive Text-to-speech (ETTS), has emerged to address the so-called one-to-many mapping problem, which limits the naturalness of generated output. However, most ETTS systems applying explicit style modeling treat the prediction of prosodic features as a regressive, rather than generative, process and, consequently, do not capture prosodic diversity. We address this problem by proposing a novel technique for inference-time prediction of speaking-style features, which leverages a diffusion framework for sampling from a learned space of Global Style Tokens-based embeddings, which are then used to condition a neural TTS model. By incorporating the diffusion model, we can leverage its powerful modeling capabilities to learn the distribution of possible stylistic features and, during inference, sample them non-deterministically, which makes the generated speech more human-like by alleviating prosodic monotony across multiple sentences. Our system blends a regressive predictor with a diffusion-based generator to enable smooth control over the diversity of generated speech. Through quantitative and qualitative (human-centered) experiments, we demonstrated that our system generates expressive human speech with non-deterministic high-level prosodic features.

Keywords:

human speech generation; text-to-speech; neural networks; diffusion models

1. Introduction

Synthesizing human speech for manually chosen textual content has a variety of potential applications, ranging from generating social media content, such as advertisements and video materials, to systems involving human–computer interaction (HCI), including personal assistants for visually impaired individuals and automated health monitoring systems. Text-to-speech (TTS) systems aim to generate human-quality speech based on a textual prompt. The process can be formulated as follows:

T T S (l_{1}, \dots, l_{t}) = a_{1}, \dots, a_{T}

(1)

where

l_{1}, \dots, l_{t}

are the input linguistic features, e.g., characters, phonemes, or additional symbols, such as sentence breaks and part of speech (POS) tags;

a_{1}, \dots, a_{T}

is a sequence of acoustic features, such as real numbers describing a raw waveform, or a 2-dimensional mel-spectrogram.

1.1. State of the Art

In recent years, numerous approaches (e.g., [1]) have emerged that successfully generate speech resembling that produced by human articulatory organs. Some of the systems achieved true human quality; i.e., they produced output that is almost indistinguishable from natural speech. The systems above successfully leverage data-driven, Deep Learning-based architectures to find interrelationships between input data samples and generate speech in an end-to-end fashion. However, although neural networks enable the system to produce output without relying on curated, hand-engineered features or complex preprocessing pipelines, classical TTS systems suffer from the so-called one-to-many mapping problem.

The one-to-many mapping problem is a result of the inherent property of human speech, which is the fact that a particular textual content may be pronounced in an infinite number of ways, which differ from each other in the speaker’s voice, speaker-independent acoustic environment, and prosody, which encompasses the high-level properties of speech, such as rhythm. Neural network-based systems, which are trained to map input text to output acoustic features, lack access to additional parameters that govern expected speech expression and therefore tend to average the speaking style across the dataset they are trained on. This results in the generation of monotonous, fully deterministic speech, which impedes the creation of an illusion of interacting with a real human being, which, in turn, is crucial for ensuring the comfortable reception of generated information. To alleviate the problem, the so-called Expressive TTS (ETTS) approaches have emerged, which aim to provide the model with additional style information through either supervised style labels or unsupervised style embedding [2].

One of the most successful approaches to ETTS leverages the concept of the Global Style Tokens, introduced in [3], which consists of transforming the reference speech into a weighted sum of randomly initialized trainable vectors, which are expected to represent various aspects of the style in an unsupervised manner. The approach was observed to facilitate robust and accurate control of speaking style; however, it made the system dependent on an additional input, such as another speech sample, from which the style was supposed to be transferred. To address this problem, several works (e.g., [4]) attempted to train an additional model to predict the style embedding solely from the input text. This may be seen as a form of teacher forcing, where the model is trained to leverage an external source of information and, during inference, uses part of the predicted output as input.

1.2. Problem Formulation

Although the described approaches for predicting style embeddings directly from textual input were successful, they all suffer from the same problem: treating the prediction of style information as a regressive rather than a generative process. They do not attempt to learn the distribution of possible style features corresponding to a given textual content and instead merely provide the trained model with explicit “helper” style information to accelerate the convergence. An analogous technique was applied in FastSpeech 2 [5], where the acoustic model is provided with ground truth pitch and energy contours, which are subsequently predicted by a specialized part of the model. The approaches mentioned above improve the model’s performance by simplifying the difficult task of mapping high-dimensional tensors. In practice, this does not solve the one-to-many mapping problem but merely tackles its primary impact, which is a monotonous speaking style within each generated sentence. As a consequence, the above systems are likely to generate similar messages or parts thereof, invariably using the same style, making communication with them unnatural. Poor expressiveness in such cases does not reside in a lack of diverse prosody within a sentence, but rather in a monotonous articulation of certain linguistic structures across multiple sentences.

In this paper, we address the problem above and propose a TTS system, which models the speaking style as a GST-based numerical vector and predicts it using a mixture of a transformer-based model and a non-deterministic diffusion framework. The rationale behind the use of diffusion models is that they exhibit a powerful ability to learn complex probability distributions and generate diverse data samples through variational inference [6,7]. The baseline approaches, in turn, use simple regression, which we believe is a suboptimal way to learn a complex distribution of stylistic features.

The backbone of the system, which is modified FastSpeech [1], is conditioned on a style embedding, which is partially produced through a backward diffusion process, which should be treated as sampling from a modeled distribution rather than prediction. Through quantitative and qualitative (human-centered) experiments, we demonstrate that the proposed system not only produces speech of comparable quality to that generated by other known TTS systems but also facilitates the generation of speech samples with considerable diversity in high-level prosodic properties, such as pitch and duration. To our knowledge, this is the first time a diffusion framework combined with GST-based unsupervised style modeling has been used to tackle the problem of generating non-deterministic natural human speech.

To ensure the reproducibility of our work, we publish both the source code and data on our GitHub repository at https://github.com/WiktorProsowicz/ddpm-gst-speech-synthesis (accessed on 29 November 2025).

2. Related Work

Systems that process human speech aim to either reduce the distance between humans and computers, offering an acoustic human–computer interface in human–computer interaction, or ensure seamless communication between people, often involving technical solutions, such as noise reduction. Among various applications of speech processing in the area of human-centered computing, we consider the following to be related to our work: ref. [8], which aimed to ensure privacy during communication in shared spaces, such as libraries, through automatic recognition of inaudible speech and source separation. These solutions enabled seamless communication in environments that previously hindered it. Ref. [9] helps control a speech synthesis system by proposing a user-friendly interface that allows for high-level control of speaking-style features for social media content creation. Ref. [10] investigated the role of the gender of a voice assistant in the reception of the spoken information. Ref. [11] investigated the potential of extracting a style representation from a speech recording to guide a model generating talking heads with lip movement and facial expressions transferred from a reference video.

The evolution of classical TTS systems was driven mainly by the struggle to achieve output that would be indistinguishable from human-quality speech. It also involved striving to create a system capable of producing fast and robust output. A common taxonomy divides the available models into auto-regressive, involving [12,13,14,15,16], and parallel ones, such as [1,5,17]. The Tacotron 2 model consists of a combination of convolutional and recurrent neural networks and is trained to predict a single spectrogram frame given the previous ones. It uses a location-sensitive attention mechanism to learn alignment between the encoder and decoder outputs automatically. The FastSpeech and FastSpeech 2 models successfully produce high-quality, robust speech, achieving high inference speed. They leverage a transformer-based architecture and are therefore easily parallelizable. Instead of an attention mechanism, they use explicit duration prediction to fill the dimensionality gap between text and sound representations.

The Transformer architecture, proposed in [18], has been successfully applied across numerous domains due to its powerful language modeling capabilities, which can be leveraged in domains where data can be represented as sequences with complex multi-scale relationships. For example, it was observed to exhibit good performance in image processing (e.g., [19,20]), where it successfully finds relationships between image patches of varying proximity. Moreover, the architecture shows great performance in tasks derived from sequence-to-sequence translation. For example, in [21,22], the transformer architecture was used to generate sequences of quantized representations of waveform and style features, respectively. A research branch is devoted to improving the performance of the classical architecture. For example, ref. [23,24] strive to reduce the quadratic complexity of their attention mechanisms by proposing linearized equivalents.

One of the first attempts to model the speaking style as a numerical vector is [25], which introduced the concept of a reference encoder that passes the reference spectrogram through an information bottleneck, enabling unsupervised modeling of the style. This approach was then incorporated, in modified form, into numerous works, such as [26,27], which attempted to compress the reference audio into a latent representation that was subsequently used to condition the generation of the spectrogram. Learning the distribution of stylistic features in a latent space was also leveraged in style-conversion tasks, proposed in works, such as [28], which incorporated speaker-adaptive normalization layers, or [21], which transferred style and voice from a short audio recording using discrete representations from a neural codec model [29].

The method proposed in [3] builds upon the reference encoder and introduces Global Style Tokens, which may be considered an intermediate state between using a clean style embedding and Vector Quantization techniques. This became a standalone area of research in ETTS, and the concept of GST was leveraged in various works, such as [30,31,32], which involve, respectively, the use of semi-supervised emotion labels, graph neural networks-based conversational speech synthesis, and modeling hierarchical prosodic features.

The first successful attempt to predict GST solely from textual input, as mentioned in the literature, is [4], which proposes two methods for reconstructing the style embedding: predicting the vector of weights, which is subsequently multiplied by the pre-trained tokens, or predicting the style embedding vector directly. Ref. [33] trains the backbone model to use discrete word-level style variation (WSV) vectors and predicts them using both enhanced text representations from the model encoder and additional BERT [34] word representations. This work was followed by [35], which incorporates WSVs into a parallel acoustic model.

Modern generative algorithms have been successfully applied in expressive text-to-speech systems. In [36], the nature of the used diffusion algorithm influences the diversity of the low- and high-level properties of speech. The authors mention the possibility of tuning an additional temperature parameter, which modifies the variance in the backward diffusion and therefore increases the variety of pitch tracks in the generated speech. Ref. [37], which leverages a parallel flows-based framework, models the prediction of phoneme durations as a generative process by training an additional flow-based model, which increases the diversity of the produced output. In [38], the non-deterministic generation is ensured by using a Variational Autoencoder for end-to-end synthesis of waveform, conditioned on phoneme representations, upsampled using a stochastic duration predictor. Ref. [27] first encodes the reference mel-spectrogram into a fixed-size style vector and predicts it using a diffusion framework, which is capable of generating diverse style representations. In our work, we combine the powerful modeling capabilities of diffusion models with the boundedness of the style distribution ensured by the semi-clustering algorithm of the GST component, which helps control the amount of style-related information in the predicted style embedding.

It should be noted that Denoising Diffusion Probabilistic Models or their derivatives were successfully applied in numerous areas to model spatio-temporal relationships. To name a few, they were incorporated in image generation [39], image denoising [40], video generation [41], and waveform generation [42]. The powerful learning capabilities of diffusion models make them a universal approach to generative AI and place them between other standard techniques, such as VAEs [7] or GANs [43].

3. Materials and Methods

In this section, we introduce the system’s architecture. A simplified diagram showing the flow of the data is presented in Figure 1. We build upon an existing acoustic model, FastSpeech, which is a parallel feed-forward model. During training, we condition the generation of output spectrograms by adding a style embedding vector derived from the reference spectrogram to the backbone model’s encoder hidden states. During inference, the style embedding is created as a combination of deterministic and non-deterministic embeddings, generated by additional modules trained on the reference data and distilled from the acoustic model. The overall architecture is shown in Figure 2. We describe the system’s modules in detail in the following subsections.

3.1. GST FastSpeech

Both the encoder and decoder parts of the FastSpeech backbone, used in our system, consist of so-called Feed-Forward Transformer (FFT) blocks, which were inspired by [18] and introduce several small modifications, described in detail in [1]. Each block consists of a Multi-Head Self Attention Layer and a convolutional upscaling layer, each followed by layer normalization [44] and residual connections [45]. We also slightly modify the architecture of the modules by (a) leaving only one intermediate ReLU activation between the 1D convolutional layers and (b) applying dropout to the attention matrix in the Multi-Head Attention layer. Denote the hidden states of the encoder as

H_{PHO} = [h_{1} . h_{2}, \dots, h_{n}]

, where n is the length of the input phonemes sequence.

Unlike the original work [3], we do not build our reference encoder upon convolutional layers, but instead use a linear projection pre-net, which extends the dimensionality of the reference spectrogram to the hidden size of the backbone model, followed by a stack of FFT blocks. The processed sequence of hidden states is then passed through a Multi-Head Attention module, where a trainable, randomly initialized vector is chosen as the query, referred to as pooling attention from this point forward.

The resulting vector is processed through a Dot-Product Attention module. Denote the reference embedding vector as

v_{R E}

and the GST matrix as

A_{G S T}

. Denote the resulting GST weights as

w_{G S T} = s o f t m a x (\frac{Q K^{T}}{\sqrt{d_{k}}}) = s o f t m a x (\frac{(v_{R E} W^{Q} + b^{Q}) * {(A_{G S T} W^{K} + b^{K})}^{T}}{\sqrt{d_{k}}})

(2)

where

W^{Q}

,

W^{K}

,

b^{Q}

, and

b^{K}

are the parameters used by the linear projection layers of the Dot-Product Attention module, and

d_{k}

is the hidden size of the model. Denote the style embedding vector as

v_{S E} = w_{G S T} V = w_{G S T} (A_{G S T} W^{V} + b^{V})

(3)

where

W^{V}

and

b^{V}

are the linear projection parameters of the Dot-Product Attention module.

3.2. Denoising Diffusion Model for GST Prediction

Creating the non-deterministic part of the inference-time style embedding is realized through sampling from a learned conditional distribution of possible embeddings:

S E_{N D e t} \sim p_{θ} (x ∣ y)

(4)

where x and y are continuous multi-dimensional random variables, which represent, respectively, the space of all possible style embeddings and tensors holding conditioning information, which in this case is a sequence of hidden linguistic representations.

As described in [6], learning the probability distribution with a diffusion framework consists of learning the intermediate conditional distributions between all adjacent steps of a diffusion process. Forward diffusion consists of T transformations applied to the data, formulated as follows:

\begin{matrix} x^{(0)} \sim q (x^{(0)}) \\ \dots \\ x^{(t)} \sim q (x^{(t)} ∣ x^{(t - 1)}) \\ \dots \\ x^{(T)} \sim q (x^{(T)} ∣ x^{(T - 1)}) \end{matrix}

(5)

where

q (x^{(t)} ∣ x^{(t - 1)}) = N (x^{(t)}; \sqrt{1 - β_{t}} x^{(t - 1)}, \sqrt{β_{t}} I)

is the distribution of the forward noising process with the chosen set of hyperparameters

β_{1}, \dots, β_{T}

. With such a formulation of the noising process, it can be shown that, for sufficiently high T, the distribution of

x^{(T)}

approaches

N (x^{(T)}; 0, I)

[46].

Since both the entire noising and denoising transformations are first-order Gaussian processes, the joint parametrized distribution of all backward diffusion steps is as follows:

p_{θ} (x^{(0)}, \dots, x^{(T)} ∣ y) = p_{θ} (x^{(T)} ∣ y) \prod_{t = 1}^{T} p_{θ} (x^{(t - 1)} | x^{(t)}, y; t - 1)

(6)

which allows for breaking down the Evidence Lower Bound (ELBO) of the negative log-likelihood into independent elements, which are maximized through minimizing KL-Divergence between forward and backward conditional distributions at particular steps, which in turn, further simplifies to minimizing square distance between parametrized

μ_{θ} (x^{(t)}, y, t - 1)

and mean value of the distribution

q (x^{(t - 1)} ∣ x^{(0)}, x^{(t)})

.

Finally, after deriving

q (x^{(t)} ∣ x^{(0)}) = (\prod_{t^{'}}^{t} \sqrt{1 - β_{t^{'}}}) x^{(0)} + (1 - \prod_{t^{'}}^{t} (1 - β_{t^{'}})) ϵ

, where

ϵ \sim N (0, 1)

, the ultimate objective is optimized by minimizing the square distance between the noise, added to the “clean” data, and parametrized

ϵ_{θ} (x^{(t)}, y, t - 1)

.

In conclusion, the data distribution is learned by training a neural network, which, based on the provided diffusion timestep information

t - 1

, noisy data

x^{(t)}

, and the conditioning information, returns the estimated total noise, added to the clean data to create

x^{(t)}

. The predicted noise is then substituted into the formula for the mean value of the distribution

q (x^{(t - 1)} ∣ x^{(0)}, x^{(t)})

, which is subsequently used to sample

x^{(t - 1)}

.

3.3. Non-Deterministic GST Predictor

S E_{N D e t}

is obtained as the result of the backward diffusion algorithm presented in Algorithm 1. The prediction of added noise in the diffusion framework is a regression problem, realized by a neural network, called Non-deterministic GST Predictor from this point forward. The network architecture is presented in Figure 3a.

Algorithm 1 Backward diffusion sampling in a diffusion framework

Require:: Trained noise prediction model $ϵ_{θ} (x^{(t)}, y, t)$ , number of steps T, noise schedule ${β_{t}}_{t = 1}^{T}$ , conditioning information y.
1:: Compute $α_{t} = 1 - β_{t}$ , ${\bar{α}}_{t} = \prod_{s = 1}^{t} α_{s}$
2:: Sample $x^{(T)} \sim N (0, I)$
3:: for $t = T$ down to 1 do
4:: $μ_{t - 1} = \frac{1}{\sqrt{α_{t}}} (x_{t} - \frac{β_{t}}{\sqrt{1 - {\bar{α}}_{t}}} ϵ_{θ} (x^{(t)}, y, t - 1))$
5:: if $t > 1$ then
6:: Sample $z \sim N (0, I)$
7:: else
8:: $z \leftarrow 0$
9:: end if
10:: $x^{(t - 1)} = μ_{t - 1} + \sqrt{β_{t}} z$
11:: end for
12:: return $x_{0}$

Encoder: The encoder processes the linguistic information into a sequence of enriched representations, which is used to condition the generation of the output noise. Since the entire system operates on phoneme representation of textual data, it is unable to learn high-level semantic information, which would significantly aid in predicting the expected output. To address this problem, we enrich the phoneme-level input to the model with hidden representations taken from a pre-trained BERT model, which was inspired by [33], which shows that the use of pre-trained text representations improves the accuracy of predicting style features, as compared with baseline cases, such as [4], which incorporates only phoneme-level representations. Denote the enriched input as follows:

H_{PHO + BERT} = c a t (H_{PHO}; H_{BERT})

(7)

where

c a t (\cdot)

is a concatenation operation, and

H_{BERT}

is a sequence of token-level BERT hidden representations, first averaged and then stretched using token-to-word and word-to-phoneme mappings, respectively. A stack of FFT blocks processes the enriched linguistic information and then passes it through the post-encoder, which consists of a fully connected layer, lowering the dimensionality of the representations to the hidden size of the backbone model, followed by SiLU [47] activation and an additional dropout [48] layer.

Decoder: The decoder predicts Gaussian noise, used to create the noised data from the time step t, using the encoded linguistic information, the time step number

t - 1

, and the noised data

x^{(t)}

. The integer

t - 1

is first encoded as a numerical vector using position encoding, as described in [18], which is then passed through a fully connected layer with SiLU activation. The noisy data from timestep

x^{(t)}

is first stretched by a 1 × 1 convolutional pre-net to the number of internal channels and subsequently transformed by a stack of residual blocks. Each block contains three 1D convolution layers with SiLU activation, preceded by group normalization [49] and followed by a dropout regularization layer. The block’s input is passed through the first convolution and subsequently used as the query to a Multi-Head Attention module, which selects a combination of transformed elements of the conditioning information sequence for each of its channels. The attention output is added to the query and passed through the second convolution layer. This is followed by adding the encoded timestep, broadcast to match the intermediate output’s dimensionality. The final convolution and a residual connection complete the total block’s transformation. The output of the last block is passed through a pooling attention module, which reduces it to a single vector.

3.4. Deterministic GST Predictor

To enable diversity control of the generated speech, we decided to train an additional module that predicts the style embedding deterministically, referred to as the Deterministic GST predictor. Its role is to predict the GST weights

w_{G S T}

, based solely on the enriched linguistic information

H_{PHO + BERT}

, which are used to obtain deterministic style embedding, denoted as

S E_{D e t}

. This decision stems from the expectation that it will be much easier to predict the weights rather than the full embedding, given the significant difference in their dimensionalities. On the contrary, due to the stochastic nature of the non-deterministic GST predictor, we expect slight fluctuations in the style-embedding vector to do less harm to the system’s output than it would do if it were to generate the weights. The reason is that since some of the softmax-based weights are expected to be close to zero, every small absolute deviation would be a considerable relative change, which would then have a visible influence on the resulting style embedding.

The neural architecture of the model is presented in Figure 3b. It accepts

H_{PHO + BERT}

, described in Equation (7), as its input, which is first passed through a fully connected layer, reducing its dimensionality, followed by ReLU activation [50] and dropout. The encoded input is then processed through a stack of FFT blocks, pooling attention, and a post-net, with a similar architecture to the pre-net, which projects it to match the dimensionality of the GST weights vector.

3.5. Inference

During inference, the control of diversity of the produced speech is realized by obtaining a linear combination of the non-deterministic and deterministic style embeddings using a hyperparameter, which is denoted as

λ

. The deterministic embedding should be considered the baseline case, which is an averaged representation of the possible style characteristics corresponding to the given textual input. Blending it with the non-deterministic embedding with specific proportions allows for controllable exploration of multiple possible style representations.

Assume that we have the following: (a) a pre-trained backbone model with reference embedding and GST vectors; (b) two GST predictors, trained using the

(H_{PHO + BERT}, v_{S E}, w_{G S T})

tuples with reference data taken from the backbone model. The inference process, given a set of input phonemes and BERT hidden representations, consists of the following steps:

Process the input phonemes through the encoder of the backbone model and obtain $H_{PHO}$ .
Concatenate BERT representations with encoder’s output to obtain $H_{PHO + BERT}$ .
Calculate $S E_{N D e t}$ using Algorithm 1, where the conditioning information y is $H_{PHO + BERT}$ .
Calculate $w_{G S T}^{'}$ using the Deterministic GST predictor and substitute it into the Equation (3) to obtain $S E_{D e t}$ .
Calculate the final style embedding as $S E = λ S E_{D e t} + (1 - λ) S E_{N D e t}$ .
Add the final style embedding to $H_{PHO}$ and pass it through the backbone model’s decoder to obtain the output spectrogram.
Convert the output spectrogram to the output waveform using a vocoder.

The resulting waveform is the system’s output and does not require further processing.

4. Results

The two main questions that were to be answered through experiments are as follows: (1) How natural and expressive does the speech produced by the proposed system sound compared to other well-known TTS systems? (2) Does the

λ

parameter of the proposed system allow us to manipulate the high-level properties of the produced speech?

To evaluate the system’s performance, we conducted two primary types of experiments: one examining numerical statistics derived from the generated speech samples and the other involving subjective evaluation by human raters. Each type comprises two experiments: the first compares various configurations of the proposed system with baseline models, whilst the latter evaluates the influence of the

λ

parameter on the variety of features obtained from the generated samples.

The pivotal part of the experiments was the evaluation of the system by human raters, which we consider crucial for the following purpose: the addressed problem revolves around human-centered computing and, as such, should be rated within a subjective evaluation framework, which is bound to yield the most valuable results, as compared to assessing the system’s performance using less informative objective metrics. We conducted a subjective evaluation by presenting 15 people with a set of generated samples and summarizing the numerical scores they assigned to them.

4.1. Experimental Setup

The data set used for the experiments is LJSpeech [51], a publicly available data set consisting of 13,100 recordings of text samples from 7 non-fiction books. A native English-speaking female recorded all samples. The textual part of each data sample was normalized and transformed into a sequence of phonemes using the g2pE 2.1.0 tool [52]. The audio files were padded to 10 s and converted to 80-dimensional mel spectrograms. The phoneme durations were obtained using the Montreal Forced Alignment tool [53].

The model parameters are described in Table 1. The hyperparameters of the backbone model (hidden dimension, number of blocks, configuration of attention) were directly adapted from the original FastSpeech paper [1]. The configuration of the GST module (number of tokens) and diffusion process (parameter schedule) were chosen empirically. In particular, selecting the number of tokens as 10, as in the original paper [3], resulted in a worse quality of the generated sound; therefore, we ultimately decided to set the value to 32.

The acoustic model was trained for 95 K steps (around eight hours). During the initial 80 K steps, the GST module was disabled, and the backbone model was conditioned directly on the Reference Encoder’s output. The GST module weights were subsequently trained for the next 10 K steps with the rest of the model frozen. Finally, the entire model was fine-tuned for the last 5000 steps. The base vanilla model was trained for 90 K steps. Training the GST Predictor took 400 K steps (about 18 h), with the deterministic part frozen after 40 K steps to prevent overfitting. The weights of all components of the model were trained using the Adam optimizer [54]. The components using FFT blocks had their learning rates adjusted according to the schedule described in [18], while the non-deterministic GST Predictor used a fixed learning rate of

2 \times 10^{- 4}

.

For the baseline model, we chose Tacotron2 [13], an autoregressive model with a recurrent neural network architecture. The samples generated by it were transformed into a waveform using either the WaveRNN vocoder [55] or the Griffin–Lim algorithm [56]. We used models offered by PyTorch 2.5.1 pre-trained models registry.

The experiments were conducted on 100 test samples, randomly selected from the data set and not used during training. Additionally, several samples have been manually selected. For each configuration of the proposed system, we generated 15 outputs corresponding to their textual content, as shown in Table 2. Those were subsequently used to examine the variety of the system’s output. For all configurations of the system, we used a pre-trained HiFi-GAN vocoder [57] to obtain the waveform from the predicted spectrograms. Both training and inference were conducted on a single NVIDIA GeForce RTX 4070 Super GPU with 12 GB of VRAM.

For experiments, we compile several versions of the system, which differ in the choice of the

λ

parameter. The chosen values are 0.2, 0.4, 0.6, 0.8, and 1.0, with the last corresponding to the fully deterministic system.

4.2. Speech Quality Evaluation with Objective Metrics

To thoroughly evaluate the system’s performance, we obtained objective metrics from the generated samples. First, we extracted the fundamental frequency (F0) contour from each sample. We aligned it with the one from the corresponding ground-truth sample using the Dynamic Time Warping (DTW) technique applied to the spectrograms. The durations were extracted from the MFA alignments. We subsequently calculated the following metrics: (a) Root Mean Square Error between the F0 contours (F0 RMSE). (b) Pearson correlation coefficient between the F0 contours (F0 Pearson). (c) Mel Cepstral Distortion (MCD), which is a measure of distance between high-level features extracted from the spectrogram, representing the properties of speech crucial for the recognition thereof. (d) Mean Relative Error between word durations (Duration MRE), which is the mean relative difference between the word durations of generated and ground truth samples. For each evaluated system, the values of each metric were averaged across all test samples.

The obtained results are presented in Table 3. We also included the metrics calculated for ground-truth spectrograms processed with the HiFi-GAN vocoder, enabling us to evaluate the bias introduced by the metric calculation.

4.3. Diversity of Expression Evaluated with Objective Metrics

The second experiment is designed to give us insight into the diversity of properties of generated speech. To this end, we obtained the same features as described in Section 4.2 from each of the 15 samples generated by each configuration of the system for the chosen textual prompts. We used the following metrics for the evaluation: (a) standard deviation of values belonging to the fundamental frequency contour; (b) standard deviation of the word durations. The metrics were calculated position-wise; that is, each text prompt yielded a sequence of standard deviation values (e.g., computed over all 15 instances of each word), which were subsequently averaged across all textual prompts.

The results are shown in Table 4. The standard deviation of the duration of words for each word separately is shown in Figure 4.

4.4. Subjective Evaluation of Speech Quality and Diversity

We investigate the following hypotheses:

H1: Speech generated by the proposed system is comparable to that produced by the baseline systems in terms of quality (lack of artifacts, quality of sound, lack of spelling errors).
H2: Speech generated by the proposed system is comparable to that produced by the baseline systems in terms of expressiveness (natural intonation, consistent rhythm).
H3: Manipulating the $λ$ hyperparameter of the system has visible influence on diversity of the high-level features of the produced speech (intonation, breaks, duration of words)

To assess the proposed system’s performance, we recruited 15 participants from our organization. The participants included adults in their twenties to their fifties. During the experiment, results were collected and stored anonymously; that is, each participant submitted a number of ratings without providing personal data.

Each participant was initially informed of all the definitions of the hypotheses H1–H3 and subsequently presented with 10 pairs of questions containing randomly sampled data. The first question contained a pair of recordings, one produced by the proposed system and the other by one of the baselines (Tacotron2 or Vanilla). The participant was asked to choose which recording sounds better in terms of (1) its quality and (2) its expressiveness. The second question contained another pair of recordings selected from all 15 outputs generated by a chosen system for a given data sample. The systems differed in their choice of the parameter

λ

, which had available values of 0.2, 0.6, and 1.0. The participant was asked to assign a numerical score of 0 to 2 to the recordings, where 0 indicated no difference in the high-level features, and 2 indicated a significant difference. The recordings for both questions were chosen from the sets described in Section 4.1. The participants were shown the textual input corresponding to the recordings they listened to.

The summarized results are shown in Figure 5a–c.

4.5. Case Study: Diversity of Word Durations

In addition to examining numerical metrics calculated across numerous test samples, we visualized the exact F0 contours obtained from each of the 15 samples generated for various system configurations. The extracted contours, together with the one obtained from the ground truth sample, for systems with the parameter

λ

equal to 0.2 and 0.8 are plotted in Figure 6a,b, respectively.

5. Discussion

By evaluating the results with objective metrics, it is clear that the proposed system achieves the best performance across all metrics; however, for each metric, there is at least one configuration that scores worse than at least one baseline model.

The system with

λ = 1.0

never achieves the best score, which may indicate that the deterministic part of the GST Predictor is insufficient to model the distribution of style embeddings accurately. On the other hand, the system with

λ = 0.2

does not achieve the best score in each case, indicating that eager exploration of the distribution of possible embeddings corresponding to a given textual prompt leads to the generation of samples that are not close to the ground-truth ones. This, in turn, suggests two things: (1) the worthiness of performing subjective tests, which do not necessarily involve direct comparison with ground truth; (2) the worthiness of using advanced distribution modeling approaches in predicting style embedding (we listened to each generated sample and found that they do not contain significant flaws, which indicates that the style-embedding space contains multiple valid points, which allow the generation of good-quality speech).

In the evaluation of speech diversity, the standard deviation values are inversely correlated with the chosen values of the parameter

λ

. This indicates not only that treating the style-embedding prediction as a generative process has a real influence on the high-level properties of speech, but also that manipulating the system’s input parameters allows for smooth control over the diversity of the produced output. The influence of the control of diversity can be seen within a single sentence as well, as manipulating

λ

has a clear impact on the diversity of the length of particular words.

The objective evaluation of different systems, presented in Figure 5a,b, shows that the proposed system outperforms the baselines both in terms of quality and expressiveness (respectively, H1 and H2). It is worth noting that a noticeable gap exists between the evaluated performance of the proposed and vanilla systems, about 80:20 and 75:25 for quality and expressiveness, respectively. Comparison with Tacotron2 also shows a slight advantage for the proposed system, with a 57:43 split in favor of quality and expressiveness. Figure 5c shows that there is a correlation between the choice of the

λ

parameter and the difference assessed between the high-level features of compared generated recordings. Similarly to the results for the ground-truth recordings in Table 3, the system with

λ = 1.0

serves as a measure of the experiment’s inaccuracy. It should be used as a reference when assessing the remaining results.

It should be noted that incorporating the additional GST predictor components increases the inference latency. Obtaining a single second of waveform using the proposed method takes, on average, 206 milliseconds as compared to 98 milliseconds in the baseline setup. Although it means doubling the inference time, we do not consider this a serious drawback for the two following reasons: (1) the latency is still better than real time, which makes it suitable for application where the response time of the system is crucial; (2) the used components make our system parallel, which means the inference time does not grow rapidly with respect to the length of the input text, as in the autoregressive systems.

Both subjective and objective results indicate the following:

The proposed method of explicit style modeling improves the quality of the generated speech, which proves its value in constructing robust TTS systems.
The use of a diffusion framework in predicting the stylistic features has a visible influence on the diversity of high-level prosodic characteristics, such as pitch and word duration.
Blending the regressive and diffusion-based predictors allows for smooth control of the produced speech’s prosody. As a result, by manipulating a single hyperparameter, we can decide whether, e.g., the pitch track will be centered near the average or deviate.

We consider the points above to prove that our system may be successfully applied in environments where natural and expressive speech generation is demanded, such as social media content creation or personal voice assistants. Our work reduces the gap between automatically generated and natural speech.

6. Conclusions

In this paper, we present our TTS system, which models speaking style by predicting GST-based style embedding in both deterministic and non-deterministic manners, using a transformer-based model and a diffusion model, respectively. The choice of architecture makes the system independent of any complex additional input; therefore, it lends itself better to real-time human–computer interfaces where the only available input data is of a single modality.

Through experiments involving the calculation of objective numerical metrics, we demonstrated that the system is capable of producing speech of comparable quality to that generated by another neural network-based TTS system. We also demonstrated that the system enables the generation of speech that is diverse with respect to its high-level prosodic features, such as fundamental frequency and word duration. By manipulating a single system’s input parameter, it is possible to adjust the level of diversity in the generated output.

In light of the statements above, we conclude that our proposed system enhances human–computer interaction by making the reception of the produced speech more natural and pleasant.

In this paper, we focus on presenting the contribution of our method to improving the expressiveness of the generated speech, both in terms of sentence-level diversity of prosodic features, as well as non-monotonous prosody between different generated sentences. We focused on a single-speaker single-sentence setup, which spans only a portion of the entire hierarchy of levels where the expressiveness can be improved, which we intend to explore in our future work. To name a few, the proposed method can be applied to a larger dataset of multi-speaker data with much more diverse prosody and speaker-specific characteristics. This would require using a stronger acoustic backbone, suited to large amounts of noisy data, and possibly a more powerful algorithm for learning probability distribution. Moreover, the demand for synthesizing long-form speech, e.g., audiobook paragraphs, influences us to think that our approach might be extended to such setups by incorporating a method of modeling inter-sentence style relationships, which would require using more complex style representations.

Author Contributions

Conceptualization: W.P.; methodology: W.P.; software: W.P.; validation: W.P.; formal analysis: W.P.; investigation: W.P.; resources: W.P.; data curation: W.P.; writing—original draft preparation: W.P. and T.H.; writing—review and editing: W.P. and T.H.; visualization: W.P.; supervision: T.H. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data are contained within the article.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

TTS	Text-to-Speech
ETTS	Expressive Text-to-Speech
GST	Global Style Tokens
WSV	Word-level Style Variation
BERT	Bidirectional Encoder Representations from Transformer
FFT	Feed Forward Transformer
DDPM	Denoising Diffusion Probabilistic Models
ELBO	Evidence Lower Bound
DTW	Dynamic Time Warping
RMSE	Root Mean Square Error
MCD	Mean Cepstral Distortion
MRE	Mean Relative Error

References

Ren, Y.; Ruan, Y.; Tan, X.; Qin, T.; Zhao, S.; Zhao, Z.; Liu, T.Y. Fastspeech: Fast, robust and controllable text to speech. In Proceedings of the 33rd International Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019; Volume 32. [Google Scholar]
Barakat, H.; Turk, O.; Demiroglu, C. Deep learning-based expressive speech synthesis: A systematic review of approaches, challenges, and resources. Eurasip J. Audio Speech Music. Process. 2024, 2024, 11. [Google Scholar] [CrossRef]
Wang, Y.; Stanton, D.; Zhang, Y.; Ryan, R.S.; Battenberg, E.; Shor, J.; Xiao, Y.; Jia, Y.; Ren, F.; Saurous, R.A. Style tokens: Unsupervised style modeling, control and transfer in end-to-end speech synthesis. In Proceedings of the International Conference on Machine Learning, PMLR, Stockholm, Sweden, 10–15 July 2018; pp. 5180–5189. [Google Scholar]
Stanton, D.; Wang, Y.; Skerry-Ryan, R. Predicting expressive speaking style from text in end-to-end speech synthesis. In Proceedings of the 2018 IEEE Spoken Language Technology Workshop (SLT), Athens, Greece, 18–21 December 2018; IEEE: New York, NY, USA, 2018; pp. 595–602. [Google Scholar]
Ren, Y.; Hu, C.; Tan, X.; Qin, T.; Zhao, S.; Zhao, Z.; Liu, T.Y. FastSpeech 2: Fast and High-Quality End-to-End Text to Speech. arXiv 2020, arXiv:2006.04558. [Google Scholar]
Ho, J.; Jain, A.; Abbeel, P. Denoising diffusion probabilistic models. Adv. Neural Inf. Process. Syst. 2020, 33, 6840–6851. [Google Scholar]
Kingma, D.P.; Welling, M. Auto-Encoding Variational Bayes. arXiv 2013, arXiv:1312.6114. [Google Scholar]
Zhou, J.; Ding, D.; Li, Y.; Lu, Y.; Wang, Y.; Zhang, Y.; Chen, Y.C.; Xue, G. M2SILENT: Enabling Multi-user Silent Speech Interactions via Multi-directional Speakers in Shared Spaces. In Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems (CHI ’25), Yokohama, Japan, 26 April–1 May 2025. [Google Scholar] [CrossRef]
Brade, S.; Anderson, S.; Kumar, R.; Jin, Z.; Truong, A. SpeakEasy: Enhancing Text-to-Speech Interactions for Expressive Content Creation. In Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems (CHI ’25), Yokohama, Japan, 26 April–1 May 2025. [Google Scholar] [CrossRef]
Danielescu, A.; Horowit-Hendler, S.A.; Pabst, A.; Stewart, K.M.; Gallo, E.M.; Aylett, M.P. Creating Inclusive Voices for the 21st Century: A Non-Binary Text-to-Speech for Conversational Assistants. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems (CHI ’23), Hamburg, Germany, 23–28 April 2023. [Google Scholar] [CrossRef]
Ma, Y.; Wang, S.; Hu, Z.; Fan, C.; Lv, T.; Ding, Y.; Deng, Z.; Yu, X. Styletalk: One-shot talking head generation with controllable speaking styles. In Proceedings of the AAAI Conference on Artificial Intelligence, Washington, DC, USA, 7–14 February 2023; Volume 37, pp. 1896–1904. [Google Scholar] [CrossRef]
Wang, Y.; Skerry-Ryan, R.; Stanton, D.; Wu, Y.; Weiss, R.J.; Jaitly, N.; Yang, Z.; Xiao, Y.; Chen, Z.; Bengio, S.; et al. Tacotron: Towards End-to-End Speech Synthesis. arXiv 2017, arXiv:1703.10135. [Google Scholar]
Shen, J.; Pang, R.; Weiss, R.J.; Schuster, M.; Jaitly, N.; Yang, Z.; Chen, Z.; Zhang, Y.; Wang, Y.; Skerrv-Ryan, R.; et al. Natural tts synthesis by conditioning wavenet on mel spectrogram predictions. In Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, 15–20 April 2018; IEEE: New York, NY, USA, 2018; pp. 4779–4783. [Google Scholar]
van den Oord, A.; Dieleman, S.; Zen, H.; Simonyan, K.; Vinyals, O.; Graves, A.; Kalchbrenner, N.; Senior, A.; Kavukcuoglu, K. WaveNet: A Generative Model for Raw Audio. arXiv 2016, arXiv:1609.03499. [Google Scholar] [CrossRef]
Arık, S.Ö.; Chrzanowski, M.; Coates, A.; Diamos, G.; Gibiansky, A.; Kang, Y.; Li, X.; Miller, J.; Ng, A.; Raiman, J.; et al. Deep voice: Real-time neural text-to-speech. In Proceedings of the International Conference on Machine Learning, PMLR, Sydney, Australia, 6–11 August 2017; pp. 195–204. [Google Scholar]
Li, N.; Liu, S.; Liu, Y.; Zhao, S.; Liu, M. Neural speech synthesis with transformer network. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27–28 January 2019; Volume 33, pp. 6706–6713. [Google Scholar]
Łańcucki, A. Fastpitch: Parallel text-to-speech with pitch prediction. In Proceedings of the ICASSP 2021–2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–11 June 2021; IEEE: New York, NY, USA, 2021; pp. 6588–6592. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017; Volume 30. [Google Scholar]
Liu, Y.; Yan, Z.; Chen, S.; Ye, T.; Ren, W.; Chen, E. NightHazeFormer: Single Nighttime Haze Removal Using Prior Query Transformer. arXiv 2023, arXiv:2305.09533. [Google Scholar]
Zhang, J.; Zhang, Y.; Gu, J.; Dong, J.; Kong, L.; Yang, X. Xformer: Hybrid X-Shaped Transformer for Image Denoising. arXiv 2023, arXiv:2303.06440. [Google Scholar]
Wang, C.; Chen, S.; Wu, Y.; Zhang, Z.; Zhou, L.; Liu, S.; Chen, Z.; Liu, Y.; Wang, H.; Li, J.; et al. Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers. arXiv 2023, arXiv:2301.02111. [Google Scholar] [CrossRef]
Ren, Y.; Lei, M.; Huang, Z.; Zhang, S.; Chen, Q.; Yan, Z.; Zhao, Z. ProsoSpeech: Enhancing Prosody With Quantized Vector Pre-training in Text-to-Speech. arXiv 2022, arXiv:2202.07816. [Google Scholar]
Choromanski, K.; Likhosherstov, V.; Dohan, D.; Song, X.; Gane, A.; Sarlos, T.; Hawkins, P.; Davis, J.; Mohiuddin, A.; Kaiser, L.; et al. Rethinking Attention with Performers. arXiv 2020, arXiv:2009.14794. [Google Scholar]
Wang, S.; Li, B.Z.; Khabsa, M.; Fang, H.; Ma, H. Linformer: Self-Attention with Linear Complexity. arXiv 2020, arXiv:2006.04768. [Google Scholar] [CrossRef]
Skerry-Ryan, R.; Battenberg, E.; Xiao, Y.; Wang, Y.; Stanton, D.; Shor, J.; Weiss, R.; Clark, R.; Saurous, R.A. Towards end-to-end prosody transfer for expressive speech synthesis with tacotron. In Proceedings of the International Conference on Machine Learning, PMLR, Stockholm, Sweden, 10–15 July 2018; pp. 4693–4702. [Google Scholar]
Li, Y.A.; Han, C.; Mesgarani, N. StyleTTS: A Style-Based Generative Model for Natural and Diverse Text-to-Speech Synthesis. IEEE J. Sel. Top. Signal Process. 2025, 19, 283–296. [Google Scholar] [CrossRef]
Li, Y.A.; Han, C.; Raghavan, V.; Mischler, G.; Mesgarani, N. StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models. In Proceedings of the Advances in Neural Information Processing Systems, New Orleans, LA, USA, 10–16 December 2023; Oh, A., Naumann, T., Globerson, A., Saenko, K., Hardt, M., Levine, S., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2023; Volume 36, pp. 19594–19621. [Google Scholar]
Min, D.; Lee, D.B.; Yang, E.; Hwang, S.J. Meta-StyleSpeech: Multi-Speaker Adaptive Text-to-Speech Generation. In Proceedings of the 38th International Conference on Machine Learning, PMLR, Virtual, 18–24 July 2021; Meila, M., Zhang, T., Eds.; PMLR: New York, NY, USA, 2021; Volume 139, pp. 7748–7759. [Google Scholar]
Défossez, A.; Copet, J.; Synnaeve, G.; Adi, Y. High Fidelity Neural Audio Compression. arXiv 2022, arXiv:2210.13438. [Google Scholar] [CrossRef]
Wu, P.; Ling, Z.; Liu, L.; Jiang, Y.; Wu, H.; Dai, L. End-to-End Emotional Speech Synthesis Using Style Tokens and Semi-Supervised Training. In Proceedings of the 2019 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), Lanzhou, China, 18–21 November 2019; pp. 623–627. [Google Scholar] [CrossRef]
Li, J.; Meng, Y.; Li, C.; Wu, Z.; Meng, H.; Weng, C.; Su, D. Enhancing speaking styles in conversational text-to-speech synthesis with graph-based multi-modal context modeling. In Proceedings of the ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 22–27 May 2022; IEEE: New York, NY, USA, 2022; pp. 7917–7921. [Google Scholar]
An, X.; Wang, Y.; Yang, S.; Ma, Z.; Xie, L. Learning Hierarchical Representations for Expressive Speaking Style in End-to-End Speech Synthesis. In Proceedings of the 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Singapore, 14–18 December 2019; pp. 184–191. [Google Scholar] [CrossRef]
Zhang, Y.J.; Ling, Z.H. Extracting and Predicting Word-Level Style Variations for Speech Synthesis. IEEE/ACM Trans. Audio Speech Lang. Process. 2021, 29, 1582–1593. [Google Scholar] [CrossRef]
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, MN, USA, 2–7 June 2019; pp. 4171–4186. [Google Scholar]
Liu, Z.; Wu, N.; Zhang, Y.; Ling, Z. Integrating Discrete Word-Level Style Variations into Non-Autoregressive Acoustic Models for Speech Synthesis. In Proceedings of the Interspeech 2022, Incheon, Republic of Korea, 18–22 September 2022; pp. 5508–5512. [Google Scholar] [CrossRef]
Jeong, M.; Kim, H.; Cheon, S.J.; Choi, B.J.; Kim, N.S. Diff-TTS: A Denoising Diffusion Model for Text-to-Speech. arXiv 2021, arXiv:2104.01409. [Google Scholar] [CrossRef]
Shih, K.J.; Valle, R.; Badlani, R.; Lancucki, A.; Ping, W.; Catanzaro, B. RAD-TTS: Parallel Flow-Based TTS with Robust Alignment Learning and Diverse Synthesis. In Proceedings of the ICML Workshop on Invertible Neural Networks, Normalizing Flows, and Explicit Likelihood Models, Virtual, 23 July 2021. [Google Scholar]
Kim, J.; Kong, J.; Son, J. Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech. arXiv 2021, arXiv:2106.06103. [Google Scholar]
Saharia, C.; Chan, W.; Saxena, S.; Li, L.; Whang, J.; Denton, E.; Ghasemipour, S.K.S.; Ayan, B.K.; Mahdavi, S.S.; Lopes, R.G.; et al. Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding. arXiv 2022, arXiv:2205.11487. [Google Scholar]
Liu, Y.; Wang, X.; Hu, E.; Wang, A.; Shiri, B.; Lin, W. VNDHR: Variational Single Nighttime Image Dehazing for Enhancing Visibility in Intelligent Transportation Systems via Hybrid Regularization. IEEE Trans. Intell. Transp. Syst. 2025, 26, 10189–10203. [Google Scholar] [CrossRef]
Ho, J.; Salimans, T.; Gritsenko, A.; Chan, W.; Norouzi, M.; Fleet, D.J. Video Diffusion Models. arXiv 2022, arXiv:2204.03458. [Google Scholar]
Kong, Z.; Ping, W.; Huang, J.; Zhao, K.; Catanzaro, B. DiffWave: A Versatile Diffusion Model for Audio Synthesis. arXiv 2020, arXiv:2009.09761. [Google Scholar]
Goodfellow, I.J.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative Adversarial Networks. arXiv 2014, arXiv:1406.2661. [Google Scholar] [CrossRef]
Ba, J.L.; Kiros, J.R.; Hinton, G.E. Layer Normalization. arXiv 2016, arXiv:1607.06450. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. arXiv 2015, arXiv:1512.03385. [Google Scholar] [CrossRef]
Turner, R.E.; Diaconu, C.D.; Markou, S.; Shysheya, A.; Foong, A.Y.K.; Mlodozeniec, B. Denoising Diffusion Probabilistic Models in Six Simple Steps. arXiv 2024, arXiv:2402.04384. [Google Scholar] [CrossRef]
Elfwing, S.; Uchibe, E.; Doya, K. Sigmoid-weighted linear units for neural network function approximation in reinforcement learning. Neural Netw. 2018, 107, 3–11. [Google Scholar] [CrossRef]
Srivastava, N.; Hinton, G.; Krizhevsky, A.; Sutskever, I.; Salakhutdinov, R. Dropout: A simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 2014, 15, 1929–1958. [Google Scholar]
Wu, Y.; He, K. Group normalization. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Agarap, A.F. Deep Learning using Rectified Linear Units (ReLU). arXiv 2018, arXiv:1803.08375. [Google Scholar]
Ito, K.; Johnson, L. The LJ Speech Dataset. 2017. Available online: https://keithito.com/LJ-Speech-Dataset/ (accessed on 29 November 2025).
Park, K.; Kim, J. g2pE. 2019. Available online: https://github.com/Kyubyong/g2p (accessed on 29 November 2025).
McAuliffe, M.; Socolof, M.; Mihuc, S.; Wagner, M.; Sonderegger, M. Montreal forced aligner: Trainable text-speech alignment using kaldi. In Proceedings of the Interspeech, Stockholm, Sweden, 20–24 August 2017; Volume 2017, pp. 498–502. [Google Scholar]
Kingma, D.P.; Ba, J. Adam: A Method for Stochastic Optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
Kalchbrenner, N.; Elsen, E.; Simonyan, K.; Noury, S.; Casagrande, N.; Lockhart, E.; Stimberg, F.; Oord, A.; Dieleman, S.; Kavukcuoglu, K. Efficient neural audio synthesis. In Proceedings of the International Conference on Machine Learning, PMLR, Stockholm, Sweden, 10–15 July 2018; pp. 2410–2419. [Google Scholar]
Griffin, D.; Lim, J. Signal estimation from modified short-time Fourier transform. IEEE Trans. Acoust. Speech Signal Process. 1984, 32, 236–243. [Google Scholar] [CrossRef]
Kong, J.; Kim, J.; Bae, J. HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis. arXiv 2020, arXiv:2010.05646. [Google Scholar]

Figure 1. Simplified flow of data in the system.

Figure 2. The architecture of the system. The solid lines show the flow of data during inference, while the dotted lines indicate the flow of data in the style conditioning module during training. The abbreviated names DR, N. Det. GST. Pred., and Det. GST Pred. stand for Diversity Regulator, Non-Deterministic Global Style Tokens Predictor, and Deterministic Global Style Tokens Predictor, respectively.

Figure 3. The network architecture of the components of GST Predictor: (a) Non-deterministic GST Predictor. (b) Deterministic GST Predictor.

Figure 4. Comparison of the standard deviation of duration of particular words in the sentence “They were followed by a crowd of reckless boys, who jeered at and insulted them” for different values of the

λ

parameter.

Figure 4. Comparison of the standard deviation of duration of particular words in the sentence “They were followed by a crowd of reckless boys, who jeered at and insulted them” for different values of the

λ

parameter.

Figure 5. Results of subjective evaluation of the proposed system’s performance. The numerical scores are summarized across all raters. (a) The results of the AB preference test, in which recordings were assessed for quality. (b) The results of the AB preference test, in which recordings were assessed for expressiveness. (c) Violin plots showing the distributions and means of ratings in the MOS test on a scale of 0–2.

Figure 6. F0 contours obtained from samples generated for a chosen textual prompt (grey) and from the ground truth sample taken from the data set (red). It is clearly visible that the F0 contours are much more diverse in the system with

λ = 0.2

, while the other ones are centered around the average. (a) Samples generated with the

λ

parameter set to 0.2. (b) Samples generated with the

λ

parameter set to 0.8.

Figure 6. F0 contours obtained from samples generated for a chosen textual prompt (grey) and from the ground truth sample taken from the data set (red). It is clearly visible that the F0 contours are much more diverse in the system with

λ = 0.2

, while the other ones are centered around the average. (a) Samples generated with the

λ

parameter set to 0.2. (b) Samples generated with the

λ

parameter set to 0.8.

Table 1. System’s parameters used for experiments.

Model’s Name	Component’s Name	Description of Parameter	Value of Parameter
Acoustic	Encoder/Decoder	Number of FFT blocks	6
	Encoder/Decoder	Number of filters in the FFT’s 1D Convolution	1536
	Duration predictor	Number of convolutional layers	4
	GST	Number of tokens	32
	Reference encoder	Number of blocks	5
	Reference encoder	Number of filters in the FFT’s 1D Convolution	384
	Global	Dropout rate	0.1
		Hidden dimension	384
		Number of heads in FFT’s multi-head attention	4
Non-det. GST Predictor	Encoder	Number of blocks	16
		Number of heads in FFT’s multi-head attention	4
		Number of filters in the FFT’s 1D Convolution	1536
		Dropout rate	0.1
	Decoder	Number of blocks	10
		Number of 1D convolution channels	128
		Timestep embedding size	128
	Diffusion	Number of steps	1000
	Diffusion	Diffusion schedule	Linear [ $β_{1} = 1 \times 10^{- 4}$ , $β_{T} = 2 \times 10^{- 5}$ ]
Det. GST Predictor	-	Number of blocks	2
		Dropout rate	0.2
		Hidden dimension	768
		Number heads in multi-head attention	16
		Number of filters in the FFT’s 1D Convolution	1536

Table 2. Textual contents of several samples from the test set chosen for evaluation of speech diversity.

Sample ID	Textual Content
LJ005-0008	They were followed by a crowd of reckless boys, who jeered at and insulted them.
LJ008-0184	Precautions had been taken by the erection of barriers, and the posting of placards at all the avenues to the Old Bailey, on which was printed.
LJ011-0095	He had prospered in early life, was a slop-seller on a large scale at Bury St. Edmunds, and a sugar-baker in the metropolis.
LJ015-0009	Cole’s difficulties increased more and more; warrant-holders came down upon him demanding to realize their goods.
LJ048-0261	Employees are strictly enjoined to refrain from the use of intoxicating liquor.
LJ050-0254	The Secret Service in the past has sometimes guarded its right to be acknowledged as the sole protector of the Chief Executive.

Table 3. Mean values of F₀ RMSE, F₀ Pearson correlation and MCD, calculated over 100 test samples for ground-truth samples with HiFi-GAN Vocoder, baseline systems, and several configurations of the proposed system. The arrows next to the metric names refer to the ordering of the results in terms of their superiority (↓ indicates that the lowest value is the best, whereas ↑ prioritizes the highest value). The values marked in bold are the best among the compared setups.

System	F₀ RMSE ↓	F₀ Pearson ↑	MCD ↓	Duration MRE ↓
GT Vocoder	55.81	0.85	3.02	0.0202
Proposed Vanilla	110.83	0.476	6.62	0.1565
Proposed $λ$ = 0.2	108.69	0.484	6.63	0.1517
Proposed $λ$ = 0.4	106.10	0.505	6.60	0.1574
Proposed $λ$ = 0.6	108.65	0.488	6.52	0.1566
Proposed $λ$ = 0.8	107.52	0.497	6.50	0.1609
Proposed $λ$ = 1.0	108.54	0.490	6.54	0.1642
Tacotron2 + WaveRNN	106.49	0.496	7.22	0.1520
Tacotron2 + Griffin-Lim	108.56	0.477	27.24	0.1585

Table 4. Mean standard deviation of word duration and F0 calculated over 15 generated samples for several textual prompts chosen from the test data set.

System	F₀	Word Duration
Proposed $λ = 0.2$	58.07	0.0145
Proposed $λ = 0.4$	52.82	0.0133
Proposed $λ = 0.6$	49.31	0.0112
Proposed $λ = 0.8$	42.27	0.0094

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Prosowicz, W.; Hachaj, T. Using Denoising Diffusion Model for Predicting Global Style Tokens in an Expressive Text-to-Speech System. Electronics 2025, 14, 4759. https://doi.org/10.3390/electronics14234759

AMA Style

Prosowicz W, Hachaj T. Using Denoising Diffusion Model for Predicting Global Style Tokens in an Expressive Text-to-Speech System. Electronics. 2025; 14(23):4759. https://doi.org/10.3390/electronics14234759

Chicago/Turabian Style

Prosowicz, Wiktor, and Tomasz Hachaj. 2025. "Using Denoising Diffusion Model for Predicting Global Style Tokens in an Expressive Text-to-Speech System" Electronics 14, no. 23: 4759. https://doi.org/10.3390/electronics14234759

APA Style

Prosowicz, W., & Hachaj, T. (2025). Using Denoising Diffusion Model for Predicting Global Style Tokens in an Expressive Text-to-Speech System. Electronics, 14(23), 4759. https://doi.org/10.3390/electronics14234759

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

Article Menu

Using Denoising Diffusion Model for Predicting Global Style Tokens in an Expressive Text-to-Speech System

Abstract

1. Introduction

1.1. State of the Art

1.2. Problem Formulation

2. Related Work

3. Materials and Methods

3.1. GST FastSpeech

3.2. Denoising Diffusion Model for GST Prediction

3.3. Non-Deterministic GST Predictor

3.4. Deterministic GST Predictor

3.5. Inference

4. Results

4.1. Experimental Setup

4.2. Speech Quality Evaluation with Objective Metrics

4.3. Diversity of Expression Evaluated with Objective Metrics

4.4. Subjective Evaluation of Speech Quality and Diversity

4.5. Case Study: Diversity of Word Durations

5. Discussion

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI