Effects of Sinusoidal Model on Non-Parallel Voice Conversion with Adversarial Learning

Voice conversion (VC) transforms the speaking style of a source speaker to the speaking style of a target speaker by keeping linguistic information unchanged. Traditional VC techniques rely on parallel recordings of multiple speakers uttering the same sentences. Earlier approaches mainly find a mapping between the given source–target speakers, which contain pairs of similar utterances spoken by different speakers. However, parallel data are computationally expensive and difficult to collect. Non-parallel VC remains an interesting but challenging speech processing task. To address this limitation, we propose a method that allows a non-parallel many-to-many voice conversion by using a generative adversarial network. To the best of the authors’ knowledge, our study is the first one that employs a sinusoidal model with continuous parameters to generate converted speech signals. Our method involves only several minutes of training examples without parallel utterances or time alignment procedures, where the source–target speakers are entirely unseen by the training dataset. Moreover, empirical study is carried out on the publicly available CSTR VCTK corpus. Our conclusions indicate that the proposed method reached the state-of-the-art results in speaker similarity to the utterance produced by the target speaker, while suggesting important structural ones to be further analyzed by experts.


Introduction
Voice conversion (VC) is a task developed to convert the observed identity of a source speaker to sound like a different target speaker, while retaining the linguistic or phonetic content unchanged. This is achieved through adjusting the spectral and prosodic features of the input speaker [1]. This technology has been applied to many potential tasks, such as speech synthesis [2], speech enhancement [3,4], normalization of impaired speech [5,6], and singing style conversion [7] and can also be used for generating new voices for animated and fictional movies. Towards the practical use of these applications, it is necessary to modify the VC approach.
A large number of popular approaches easily construct a conversion function with either a statistical Gaussian mixture model (GMM) [8] or Gaussian process regression (GPR) [9] that modifies acoustic features (such as mel-frequency cepstral coefficients) between source and target speakers. However, the simplicity of these models and the lack

Related Work and Motivation
In recent times, successful efforts have been considered to develop non-parallel methods. Among them, a conditional variational autoencoder (CVAE) framework was proposed in [23], consisting of two main networks: encoder and decoder. The idea is that the input speech samples are firstly converted by an encoder to latent vectors having the input linguistic variables. Then, a one-hot vector having the target speaker identity is created and later fed together with latent vectors into a decoder to produce the target utterance features. This approach is called a conditional VAE due to the decoder which is trained on a target speaker identity vector. One of the problems encountered here is that the decoder is over-smoothed. This over-smoothing effect is caused by the Gaussian assumptions on the encoder and decoder distributions. In addition, over-smoothness mostly occurs in spectral features as they have low-dimensional features. Parametric techniques usually suffer from over-smoothing because they use the minimum mean square error or the maximum likelihood function as the optimization criterion. Consequently, VAE fails to capture the desired details of temporal and spectral dynamics. This commonly results in poor-quality converted speech.
One powerful technique that can possibly overcome the weakness of CVAEs is the cycle-consistent generative adversarial network (CycleGAN), which was proposed in [24] using generators, discriminators, and an identity-mapping loss. The idea is that the source features are transformed to match the features of a target voice via a GAN model. Then, the result is again converted back to match the source characteristics via an additional GAN. A conjunction of both cycle consistency and adversarial losses is finally used to force the linguistic contexts to be retained in the converted speech. The adversarial loss only tells us whether the generator follows the distribution of target data and does not ensure that the contextual information is preserved. The cycle-consistency loss is introduced to encourage the generator to find the mapping that preserves underlying linguistic content between the input and output. For these reasons, CycleGAN represents a successful deep learning implementation to find an optimal pseudo pair from non-parallel data of paired speakers. It does not require any frame alignment mechanism such as dynamic time warping or attention. While this approach was found to work reasonably well in one-to-one mappings, CycleGAN fails to work in many-to-many VC tasks because it may require several generator-discriminator pairs trained separately. Hence, this increases the computational complexity of learning parameters, training time, and the memory space prohibitively. For VC applications, this can degrade conversion performance as multiple speakers are trained simultaneously. To resolve this issue, StarGAN-VC [25] was recently introduced as a unified model architecture. It accepts concurrent training of various domains, i.e., many-to-many mapping within a single network. It can also be noted that StarGAN gives good quality among the GAN-based VC frameworks. However, it lacks stable training [26], which can be overcome by the proposed sinusoidal model using continuous parameters.
Despite the vast amount of research that has been provided in the literature for nonparallel voice conversion, the synthesized speech obtained from converted features is still far from achieving the quality of the target speaker. Therefore, the new generation of highquality converted speech is still very challenging and has room for improvement. Motivated by this, we propose a new less expensive sinusoidal architecture as an independent study, which is the first GAN-based sinusoidal model to perform non-parallel VC with only a few training examples. We observe that the proposed model is able to give objective results more efficiently than the StarGAN scheme, and can achieve a good speech quality to the target speaker.

Proposed Continuous Sinusoidal Model
This section will introduce our developed sinusoidal model, a framework to address a speech analysis-synthesis system derived from continuous parameters.
Based on a sinusoidal model, an analysis/synthesis system is characterized in terms of the amplitudes, frequencies, and phases of the component sine waves to synthesize high-quality speech. The scope of this scenario is designed to model the voiced speech as a sum of quasi-harmonics with instantaneous phases and fundamental frequency (F0) [27]. Conventionally, the F0 contour is discontinuous at voiced-unvoiced (V-UV) boundaries because F0 is not classified in unvoiced sounds. This can cause some issues in statistical modeling, which involves building separate models for voiced and unvoiced frames of speech. On the other hand, the continuous F0 does not need modeling around V-UV and UV-V transitions. In this study, the V/UV decision can be left up to the maximum voiced frequency feature. Moreover, recent work has demonstrated that a neural vocoder (such as the WaveNet model) yields state-of-the-art performance and gives good-sounding speech. However, it requires a large quantity of data, computation power, and it needs to be repeated sequentially (one sample at a time), making it difficult to train for real-time implementation, especially in embedded environments. Therefore, a continuous sinusoidal model (CSM) is built to tackle the limitations of discontinuity in the speech parameters and the complexity of neural vocoders. Unlike the standard source-filter vocoders, CSM uses harmonic features to simplify and enhance the synthesizing phase prior to reconstruction. Figure 1 shows the main components of the developed sinusoidal model. Based on a sinusoidal model, an analysis/synthesis system is characterized in terms of the amplitudes, frequencies, and phases of the component sine waves to synthesize high-quality speech. The scope of this scenario is designed to model the voiced speech as a sum of quasi-harmonics with instantaneous phases and fundamental frequency (F0) [27]. Conventionally, the F0 contour is discontinuous at voiced-unvoiced (V-UV) boundaries because F0 is not classified in unvoiced sounds. This can cause some issues in statistical modeling, which involves building separate models for voiced and unvoiced frames of speech. On the other hand, the continuous F0 does not need modeling around V-UV and UV-V transitions. In this study, the V/UV decision can be left up to the maximum voiced frequency feature. Moreover, recent work has demonstrated that a neural vocoder (such as the WaveNet model) yields state-of-the-art performance and gives good-sounding speech. However, it requires a large quantity of data, computation power, and it needs to be repeated sequentially (one sample at a time), making it difficult to train for real-time implementation, especially in embedded environments. Therefore, a continuous sinusoidal model (CSM) is built to tackle the limitations of discontinuity in the speech parameters and the complexity of neural vocoders. Unlike the standard source-filter vocoders, CSM uses harmonic features to simplify and enhance the synthesizing phase prior to reconstruction. Figure 1 shows the main components of the developed sinusoidal model. Overview of the developed system. CSM consists of three analysis algorithms (for determining the contF0, spectral envelope, and MVF) and a synthesis algorithm incorporating these parameters.

Analysis Phase: Feature Extraction
Here, we have designed our continuous sinusoidal model (i.e., in which all parameters are continuous) using three acoustic parameters (contF0, maximum voiced frequency and mel-generalized cepstrum). We use a continuous F0 (not a discontinuous F0 such as Swipe, Yin, DIO, etc.) tracker which takes non-zero pitch values even when voicing is not present and does not apply a strict voiced/unvoiced decision. We choose the continuous modeling proposed in [28] as it can be more effective in achieving natural synthesized speech. Another excitation continuous parameter is the maximum voiced frequency (MVF) [29], which was recently shown to result in a major improvement in the quality of synthesized speech. During the synthesis of various sounds, the MVF parameter can be used as a boundary frequency to separate the voiced and unvoiced components. Therefore, contF0, MVF, and MGC [30] parameter streams are calculated during the analysis phase to be converted. The advantage of this CSM is that it is relatively simple; it has only two one-dimensional parameters for modeling excitation (contF0 and MVF) and the synthesis part is computationally feasible, therefore speech generation can be performed in real time.
It should be noted that contF0 is still sensitive to additive noise in speech signals and suffers from short-term errors (when it changes rather quickly over time). Moreover, it can cause some tracking errors when the speech signal amplitude is low, voice is creaky, or there is a low harmonic-to-noise ratio (HNR). To mitigate these issues, an instantaneous frequency-based (i.e., the time-derivative of the phase) method is employed. In our implementation, the formula for computing the instantaneous frequency IF(τ) is given by Flanagan's equation [31]: Figure 1. Overview of the developed system. CSM consists of three analysis algorithms (for determining the contF0, spectral envelope, and MVF) and a synthesis algorithm incorporating these parameters.

Analysis Phase: Feature Extraction
Here, we have designed our continuous sinusoidal model (i.e., in which all parameters are continuous) using three acoustic parameters (contF0, maximum voiced frequency and mel-generalized cepstrum). We use a continuous F0 (not a discontinuous F0 such as Swipe, Yin, DIO, etc.) tracker which takes non-zero pitch values even when voicing is not present and does not apply a strict voiced/unvoiced decision. We choose the continuous modeling proposed in [28] as it can be more effective in achieving natural synthesized speech. Another excitation continuous parameter is the maximum voiced frequency (MVF) [29], which was recently shown to result in a major improvement in the quality of synthesized speech. During the synthesis of various sounds, the MVF parameter can be used as a boundary frequency to separate the voiced and unvoiced components. Therefore, contF0, MVF, and MGC [30] parameter streams are calculated during the analysis phase to be converted. The advantage of this CSM is that it is relatively simple; it has only two one-dimensional parameters for modeling excitation (contF0 and MVF) and the synthesis part is computationally feasible, therefore speech generation can be performed in real time.
It should be noted that contF0 is still sensitive to additive noise in speech signals and suffers from short-term errors (when it changes rather quickly over time). Moreover, it can cause some tracking errors when the speech signal amplitude is low, voice is creaky, or there is a low harmonic-to-noise ratio (HNR). To mitigate these issues, an instantaneous frequency-based (i.e., the time-derivative of the phase) method is employed. In our im-plementation, the formula for computing the instantaneous frequency IF(τ) is given by Flanagan's equation [31]: where a and b are, respectively, the real and imaginary parts of the spectrum of a waveform S(w). Consequently, contF0 can be further corrected by where K represents the harmonics number used for refining (we set K = 6), w 0 denotes the angular frequency of the contF0 at a temporal position t. The new corrected contF0 performance can be compared with the performance of a reference pitch contour (f0_egg, estimated from the electroglottograph) as shown in Figure 2. We can see that there is almost equivalent performance between the refined contF0 and the f0_egg, and much better than the baseline [28]. It can be also observed in the unvoiced region (frames from 170 to 202 in Figure 2) that the new contF0 is reduced more significantly than for the baseline.
( ) = − + where and are, respectively, the real and imaginary par form ( ). Consequently, contF0 can be further corrected by where represents the harmonics number used for refinin the angular frequency of the contF0 at a temporal position performance can be compared with the performance of a ref estimated from the electroglottograph) as shown in Figure 2 most equivalent performance between the refined contF0 and than the baseline [28]. It can be also observed in the unvoice 202 in Figure 2) that the new contF0 is reduced more signific In general, the spectral envelope must be characterized rameters, either for reaching a high compression rate or for bly, a mel-generalized cepstral (MGC) analysis is implemen these features have shown their efficiency to accurately and envelope. In this study, we followed the CheapTrick algor [30]. Thus, we found 36 coefficients are enough to run our sy As a result, Figure 3 shows three streams of acoustic informa frame: F0, MVF, and the spectral envelope. We then had an a 1 contF0, 1 MVF. In general, the spectral envelope must be characterized using a small number of parameters, either for reaching a high compression rate or for statistical modeling. Preferably, a mel-generalized cepstral (MGC) analysis is implemented on the speech signals, as these features have shown their efficiency to accurately and robustly capture the spectral envelope. In this study, we followed the CheapTrick algorithm proposed by Masanori [30]. Thus, we found 36 coefficients are enough to run our synthesized converted speech. As a result, Figure 3 shows three streams of acoustic information extracted from a speech frame: F0, MVF, and the spectral envelope. We then had an acoustic vector with 36 MGCs, 1 contF0, 1 MVF.

Synthesis Phase: Synthesized Speech
The synthesis algorithm applied in CSM constructs the s component ( ) and noise component ( ) in accordance be described by In voiced frames, the harmonic part can be calculated by mula as: Where and φ are the harmonic amplitudes and phases at 16 is the sampling frequency, = 0, 1, … , , and is the f ber of harmonics that depend on both contF0 and MVF:

Synthesis Phase: Synthesized Speech
The synthesis algorithm applied in CSM constructs the speech frames into a voiced component s v (t) and noise component s n (t) in accordance with MVF values. This can be described by In voiced frames, the harmonic part can be calculated by the following general formula as: where A and ϕ are the harmonic amplitudes and phases at frame i, respectively, F s = 16 kHz is the sampling frequency, t = 0, 1, . . . , L, and L is the frame length. K is the number of harmonics that depend on both contF0 and MVF: where H h is a complementary low-pass filter for the harmonic part, C i is complex harmonic log-amplitude obtained by resampling the MGC envelope: where α is the all-pass factor that takes 0.42 for F s = 16 kHz. The phases are obtained recursively in a minimum phase response between harmonics in adjacent frames: where kγ i is a linear-in-frequency term which can be attributed to the underlying excitation, and T is the frame shift measured in samples (typically it corresponds to a 5ms interval). The synthetic noise part n(t), with this work building on our previous study [32], is firstly filtered by a high-pass filter f h (t) with a cutoff frequency equal to the local MVF, and then modulated by its Hilbert envelope e(t): In unvoiced frames, on the other hand, the harmonic part is zero (MVF = 0) and the synthetic frame is formed only by noise. Hence, we can reconstruct the speech signal exactly by summing up the harmonic and noise components.

Modified Generative Adversarial Networks
Our proposed algorithm is adopted from the StarGAN approach [33], which was proposed for multi-domain image-to-image translation. It differs from the method in [25] by introducing constructive changes in terms of training architecture to be able to adapt the proposed sinusoidal framework. Our aim is to use a single generator G that can learn mappings between a group of speakers. To achieve this, we train G to convert the attribute of source x ∈ R Q×T x speaker domain into target y ∈ R Q×T y speaker domain conditioned on the target domain label c ∈ {1, . . . , N} to generate a new acoustic feature sequencê y = G(x, c), where N is the number of domains, Q is the feature dimension, and T is the sequence length. An auxiliary classifier is also introduced as with [34] which allows a real/fake discriminator D to learn the best decision boundary between the converted and real acoustic features. Hence, our D produces a probability D(ŷ, c ) over both sources and domain labels that c (of the input source speech) designed to produce class probabilities p(c |y) . Figure 4 displays the training process of the suggested approach.
Appl. Sci. 2021, 11, 7489 7 of 16 where is a linear-in-frequency term which can be attributed to the underlying excitation, and is the frame shift measured in samples (typically it corresponds to a 5 interval).
The synthetic noise part ( ), with this work building on our previous study [32], is firstly filtered by a high-pass filter ( ) with a cutoff frequency equal to the local MVF, and then modulated by its Hilbert envelope ( ): In unvoiced frames, on the other hand, the harmonic part is zero ( = 0) and the synthetic frame is formed only by noise. Hence, we can reconstruct the speech signal exactly by summing up the harmonic and noise components.

Modified Generative Adversarial Networks
Our proposed algorithm is adopted from the StarGAN approach [33], which was proposed for multi-domain image-to-image translation. It differs from the method in [25] by introducing constructive changes in terms of training architecture to be able to adapt the proposed sinusoidal framework. Our aim is to use a single generator that can learn mappings between a group of speakers. To achieve this, we train to convert the attribute of source ∈ × speaker domain into target ∈ × speaker domain conditioned on the target domain label ∈ 1, … , to generate a new acoustic feature sequence = ( , ), where is the number of domains, is the feature dimension, and is the sequence length. An auxiliary classifier is also introduced as with [34] which allows a real/fake discriminator to learn the best decision boundary between the converted and real acoustic features. Hence, our produces a probability ( , ) over both sources and domain labels that (of the input source speech) designed to produce class probabilities ( | ). Figure 4 displays the training process of the suggested approach. Converted speech can also be clipped (which inevitably changes the spectrum of speech signals), depending on the input gain. This will partially distort the speaker information contained in the signal. To stabilize the training procedure, we applied three preservation losses (adversarial, classification, and reconstruction) in the objective func- Converted speech can also be clipped (which inevitably changes the spectrum of speech signals), depending on the input gain. This will partially distort the speaker information contained in the signal. To stabilize the training procedure, we applied three preservation losses (adversarial, classification, and reconstruction) in the objective function to alleviate the issues of the over-smoothing caused by statistical averaging. The smaller these losses are, the closer the converted data distribution is to a normal speech distribution.

Adversarial Loss (L adv )
This loss works to render the converted features indistinguishable from the real target feature: where E means the expected value, G(x, c) generates fake data conditioned on both the source speaker's data x and the target label c, whilst D aims to distinguish between real and fake data. That is, the G seeks to minimize this loss, while the D attempts to maximize it.

Classification Loss (L cls )
In order to synthesize the acoustic feature that belongs to the target domain, we append an auxiliary classifier C on top of D and impose the L cls when updating both D and G. Thus, we decompose this loss into: a) classification loss of real speech data L real cls used to optimize D: where the term D(c |y) represents a probability distribution of real speech data y over domain labels computed by D. By minimizing this loss, D tries to classify y to its corresponding domain c . We assume that the input data and domain labels are provided by training examples. Moreover, b) classification loss of fake speech data L f ake cls used to optimize G: L f ake where D(c|G(x, c)) represents the probability distribution of fake data G(x, c) over domain labels computed by D. Hence, the idea is to minimize L real cls (C) with respect to C and L f ake cls (G) with respect to G to construct data that can be classified as the target domain c.

Reconstruction Loss (L rec )
The L adv and L cls encourage a converted acoustic feature to become realistic and classifiable, respectively. However, they do not guarantee that the converted feature will preserve the content of the linguistic information while changing only the speaker domain-related information. To alleviate this problem, the L rec is used: This L rec encourages G to find an optimal source and target pair that does not compromise the composition, where ||·|| 1 means L1 norm, G(x, c) is the generated data conditioned on x and the target domain label c, and G(G(x, c), c ) is reconstructing the original speech x which is conditioned on G(x, c) and the original domain label c .

Full Objective
By combining the above three losses, an exemplar can be learned from unparalleled training examples to map an input x to the desired output y. More precisely, the objective functions to optimize G and D by minimizing L G and L D are: where λ cls ≥ 0 and λ rec ≥ 0 are regularization parameters of the classification and reconstruction losses, respectively, compared to the adversarial loss. In this study, we use λ cls = 1 and λ rec = 10.

Conversion Process
First, we feed the source speech into the CSM analyzer to extract the contF0, MVF, and spectral envelope. Then, we build an acoustic feature vector by stacking those parametric features. To have zero mean and unit variance in the training dataset, we normalize the acoustic features over all the speakers. Next, both generator and discriminator are optimized in an iterative way in the training time (number of epochs), where one module is being updated while the model parameters of another are fixed. It can be stated that the contF0 and MVF are converted frame by frame using the statistical distribution (meanvariance transformations) of the source speaker to target one. For example, to convert the contF0: where contF0 f is the log-scaled contF0 after conversion at frame i; µ x and σ x are the source mean and variance of the contF0, respectively; µ y and σ y are the target mean and variance of the contF0, respectively; and ∑ N i w i = 1 is the weight. For the spectral envelope, we use low-dimensional representation in a 36-MGC domain to reduce complexity. Thus, the MGC features of the source speaker are converted into those of the target speaker using the trained model described in Section 4. Once the training process is completed, we use the CSM synthesizer to generate speech from the converted features. The converted speech will present both the linguistic contents of the source utterance as well as the speaker attributes (identity, gender, and accent). An algorithm is presented below for summarizing the process of our acoustic modeling.

Network Architecture
We use a 2D convolutional neural network (CNN) to build the generator and discriminator. The generator network comprises two convolutional layers, six CNN residual blocks [35], two transposed CNN layers, and a 2 × 2 stride for downsampling and upsampling. In contrast, two separate five CNN layers are used for discriminator and classifier networks. Instance normalization [36] is used for the generator as this greatly improves the stability of the training but no normalization is employed for the discriminator. All these layers are followed by a rectified linear unit (ReLU) activation function, and the output layer followed by a sigmoid activation function. Each model is trained using Adam as the optimizer [37] with β1 = 0.9 and β2 = 0.999. The batch size is set to 32.

Algorithm 1 Acoustic modeling for voice conversion
Require: Feature extraction and initialization 1: x := source features; c := speaker label; y := target features 2: Set parameters G; D 3: Initialize batch size m; learning rate η; weights for losses λ; number of total iterations n Begin: Adversarial training model 1: for epoch = n 1 , . . . , n k do 2: for training examples in (x, c, y) do 3: for i = 1, . . . , m dô y = G(x, c) D(ŷ, c ) 4: end for 5: optimize G and D by minimizing losses L G and L D :

6:
Update D while fixing G 7: end for 8: end for End Begin: Generation of converted speech 1: for training data in (x, c , y) do generate contF0 , MVF , MGC synthesis CSM(contF0 , MVF , MGC ) 2: end for End

Evaluation and Discussion
In this section, experimental setup, statistical evaluations, and a perceptual listening test in order to confirm the performance of the proposed framework are described.

Experimental Setup
Unlike [25,33], this work is conducted on the CSTR VCTK corpus [38], which contains 46 h of English speech from 109 speakers with various accents. From this database, our proposed system was evaluated on data of 8 speakers (4 males and 4 females). The training examples of each speaker are split into 90% training and 10% test sets. As the conversion setting is non-parallel, we did not use any time alignment procedures for training and each speaker reads a different set of sentences. All the speech signals were resampled to a sampling rate of 16 kHz. The 36-dimensional MGC, 1-dimensional MVF, and 1-dimensional contF0 are extracted from the speech of the speakers, and the corresponding stats (means and standard deviations) computed. The acoustic features were computed using a 25 ms window and 5 ms frameshift. We conduct 8 inter-gender (female-to-male and male-tofemale) conversions and 4 intra-gender (male-to-male and female-to-female) conversions.
As there are 8 speakers involved in our experiments, c is represented as an 8-dimensional one-hot vector and there were 12 different combinations of source and target speakers in total. Thus, there are 960 samples (10 utterances * 8 speakers * 12 tests) that needed to be evaluated in total. An NVidia Titan X GPU is used for training. The state-of-the-art system in this study follows the structure of the StarGAN-VC model [25] that achieved greater naturalness of the converted speech in many-to-many non-parallel VC. Thus, we assess our proposed system with it. For comparative evaluation, we choose to exclude any parallel data in the training and testing data.

Statistical Evaluations
To show that our model is able to convert speaker characteristics, several performance metrics are used. While the spectral distortion is designed to compute the distance between two power spectra, the root mean square (RMS) log spectral distance (LSD) metric is suggested here to carry out the evaluation as where k is the frequency bin, P( f ) is the spectral power magnitude of the real speech, whilê P( f ) is the spectral power magnitude of the converted speech, and both are defined at N frequency points. The optimal value of LSD RMS is zero, which indicates a matching frequency content. Here, we average the LSD RMS values for the whole set of tested intergender examples (female-to-male and male-to-female) separately as where M is the total number of examined examples, which is equal to 10 in this study. We also use a set of example spectrograms for the converted voices, to see whether voice conversion results capture the target speaker. The results are shown in Figure 5, and we found that the proposed framework has lower LSD RMS values equal to 1.7 dB (female to male) and 2 dB (male to female) that are closer to the target speech spectrogram than the StarGAN model. Consequently, our proposed system introduces a smaller distortion to the sound quality and approaches a correct spectral criterion.

Statistical Evaluations
To show that our model is able to convert speaker characteristics, several performance metrics are used. While the spectral distortion is designed to compute the distance between two power spectra, the root mean square (RMS) log spectral distance (LSD) metric is suggested here to carry out the evaluation as where is the frequency bin, ( ) is the spectral power magnitude of the real speech, while ( ) is the spectral power magnitude of the converted speech, and both are defined at frequency points. The optimal value of LSD is zero, which indicates a matching frequency content. Here, we average the LSD values for the whole set of tested intergender examples (female-to-male and male-to-female) separately as where M is the total number of examined examples, which is equal to 10 in this study. We also use a set of example spectrograms for the converted voices, to see whether voice conversion results capture the target speaker. The results are shown in Figure 5, and we found that the proposed framework has lower LSD values equal to 1.7 dB (female to male) and 2 dB (male to female) that are closer to the target speech spectrogram than the StarGAN model. Consequently, our proposed system introduces a smaller distortion to the sound quality and approaches a correct spectral criterion. In addition, the empirical cumulative distribution function (ECDF) [39] of the phase distortion mean (PDM) [40] is computed and presented in Figure 6. The reason to compute this function is to see whether these conversion methods can be normally distributed and if we can compare it to the natural target speech. PDM is estimated in this experiment at a 5 frameshift by (http://covarep.github.io/covarep/ accessed on 24 December 2016): In addition, the empirical cumulative distribution function (ECDF) [39] of the phase distortion mean (PDM) [40] is computed and presented in Figure 6. The reason to compute this function is to see whether these conversion methods can be normally distributed and if we can compare it to the natural target speech. PDM is estimated in this experiment at a 5 ms frameshift by (http://covarep.github.io/covarep/ accessed on 24 December 2016): where , N is the number of frames, PD is the phase difference between two consecutive frequency components, and we denote the phase by ∠. The standard deviation in Equation (26) is supposed to represent the noisiness of the voice source, which allows for obtaining a more robust estimate of the source shape in transients. Additionally, conventional systems had issues with modeling the high frequency voiced/voiceless contents, so we wanted to show that our system is better in this sense.
pl. Sci. 2021, 11, 7489 12 = ( ) = −2 1 is the number of frames, PD is the phase differe between two consecutive frequency components, and we denote the phase by ∠. standard deviation in Equation (26) is supposed to represent the noisiness of the v source, which allows for obtaining a more robust estimate of the source shape in tra ents. Additionally, conventional systems had issues with modeling the high freque voiced/voiceless contents, so we wanted to show that our system is better in this sens As we wanted to quantify the noisiness in the higher frequency bands only, we out the PDD values below the MVF contour. The ECDF ( ) is expressed as where , … , are the PDM variables, is the number of experimental observati As we wanted to quantify the noisiness in the higher frequency bands only, we zero out the PDD values below the MVF contour. The ECDF F n (x) is expressed as where X 1 , . . . , X n are the PDM variables, n is the number of experimental observations, #A represents the number of elements in the sample X ≤ x, I is the indicator function In the positive x-axis distribution shown in Figure 6, the proposed system is better reconstructed than that of the StarGAN approach. These outcomes can be supported by the fact that developing a sinusoidal VC system is beneficial and can substantially transform the source speaker into a particular target without parallel corpora. In a negative x-axis distribution, as shown in Figure 6, the proposed system still yields better voice conversion performance than others and almost reaches the target distribution for intra-gender pairs (male/male or female/female in Figure 6). Hence, the experimental results objectively validate the success of the sinusoidal VC system, and it offered better performance than state-of-the-art StarGAN.

Qualitative Evaluations
A perceptual listening test was created to assess the performance of our proposed model on a non-parallel many-to-many VC task. We conducted a speaker similarity test to compare the dissimilarity and quality of the converted speech to a natural source speaker. The listeners should give a score for each stimulus, from 0 (highly similar to the source speaker) to 100 (highly similar to the sound of the target speaker). The listeners only have a reference example from both source and target speakers in order to distinguish and know the target speaker. Then, listeners have to evaluate the heard sentences based on the target speaker information (gender, accent, volume, etc.). Different sentences were randomly selected from each conversion pair and given in a randomized order. Thus, 81 utterances were involved in the listening test (3 types x 27 sentences) and randomly presented to the participants. Fifteen participants (seven males and eight females) participated in the experiment. The listening test took roughly 12 min. The audio samples can be found online (https://malradhi.github.io/contSM-VC accessed on 14 June 2021). Figure 7 shows the scores of the similarity test. The error bars represent the 95% confidence interval. As the results show, the developed and reference systems accomplish the same performance to the target voice. In other words, the proposed model has acceptably transformed the source voice to the target voice in the same-gender and cross-gender cases. It can also be found that both systems have equal performance considering similarity to the target speaker. In other words, there is no statistically significant difference between the proposed and StarGAN systems. The proposed framework validates the effectiveness of the sinusoidal model with continuous fundamental frequency in the conversion pipeline. From this perceptual test, we can conclude that our model is able to produce a voice like that of the target speaker in comparison to the system with more complicated discontinuous F0.

Conclusions
This paper proposed a novel alternative framework for advancing the accuracy of non-parallel many-to-many voice conversions. The main idea was to employ a sinusoidal model with continuous parameters to generate converted speech signals with an adversarial training network. The main advantages of the sinusoidal model are the high accuracy of harmonic parameter estimation and increased fidelity in converting the source speaker to the target speaker. From the empirical studies, it was confirmed that the proposed approach has a better ability to convert the source speaker to the target one than the state-of-the-art system. The results of the listening test also revealed the effectiveness of the suggested system for improving the quality of synthetic speech comparable with StarGAN.
As future directions, we will attempt to increase the usability of our sinusoidal approach by using the Griffin-Lim algorithm [41] and -vector technique [42]. The GLA is a phase reconstruction method that involves an iterative procedure for estimating a signal from the short-time Fourier transform (STFT) magnitude. The GLA is based only on the consistency and does not take any prior knowledge about the target signal into account. We also plan to adopt a deep vector or "d-vector" method for voice conversion. A d-vector is a fixed-dimensional representation of a speech utterance that enables more natural voice conversion. We hope that our work enables users to generalize more VC tasks (e.g., multi-emotion and multi-pronunciation).

Conclusions
This paper proposed a novel alternative framework for advancing the accuracy of non-parallel many-to-many voice conversions. The main idea was to employ a sinusoidal model with continuous parameters to generate converted speech signals with an adversarial training network. The main advantages of the sinusoidal model are the high accuracy of harmonic parameter estimation and increased fidelity in converting the source speaker to the target speaker. From the empirical studies, it was confirmed that the proposed approach has a better ability to convert the source speaker to the target one than the state-of-the-art system. The results of the listening test also revealed the effectiveness of the suggested system for improving the quality of synthetic speech comparable with StarGAN.
As future directions, we will attempt to increase the usability of our sinusoidal approach by using the Griffin-Lim algorithm [41] and d-vector technique [42]. The GLA is a phase reconstruction method that involves an iterative procedure for estimating a signal from the short-time Fourier transform (STFT) magnitude. The GLA is based only on the consistency and does not take any prior knowledge about the target signal into account. We also plan to adopt a deep vector or "d-vector" method for voice conversion. A d-vector is a fixed-dimensional representation of a speech utterance that enables more natural voice conversion. We hope that our work enables users to generalize more VC tasks (e.g., multi-emotion and multi-pronunciation).  Institutional Review Board Statement: Not applicable.

Informed Consent Statement: Not applicable.
Data Availability Statement: Not applicable. The study did not report any data.