Voice Conversion Using a Perceptual Criterion

In voice conversion (VC), it is highly desirable to obtain transformed speech signals that are perceptually close to a target speaker’s voice. To this end, a perceptually meaningful criterion where the human auditory system was taken into consideration in measuring the distances between the converted and the target voices was adopted in the proposed VC scheme. The conversion rules for the features associated with the spectral envelope and the pitch modification factor were jointly constructed so that perceptual distance measurement was minimized. This minimization problem was solved using a deep neural network (DNN) framework where input features and target features were derived from source speech signals and time-aligned version of target speech signals, respectively. The validation tests were carried out for the CMU ARCTIC database to evaluate the effectiveness of the proposed method, especially in terms of perceptual quality. The experimental results showed that the proposed method yielded perceptually preferred results compared with independent conversion using conventional mean-square error (MSE) criterion. The maximum improvement in perceptual evaluation of speech quality (PESQ) was 0.312, compared with the conventional VC method.


Introduction
Voice conversion (VC) is a method of changing the features derived from speech signals, so that one voice is made to sound like another. If the features of one speaker (reference speaker) are modified so that the features are close to those of another specific speaker (target speaker), the resultant speech signals sound as if it was spoken by target speaker. This technique is referred to as voice personality transformation [1]. Voice personality transformation has numerous applications in a variety of areas such as personification of text-to-speech synthesis systems [2,3], speaker adaptation for automatic speech recognition [4], reducing the artifacts of abnormal speech [5], and foreign language training systems [6].
VC is closely related with speaker recognition/identification tasks [7] and practically achieved by using converted speech parameters to synthesize speech. The feature parameters adopted in VC reflect the speaker-related characteristics. Typical feature parameters that satisfy such properties include Mel-frequency cepstral coefficients (MFCCs) [8,9], linear prediction coefficients (LPCs) [10][11][12][13] and line spectrum pair (LSP) coefficients [14][15][16]. Pitch period and the spectrum of LP-residual (spectral fine structure) have also been adopted for VC [17]. These have important roles in modifying source characteristics of the given voices [18].
The ultimate goal of VC is to convert input reference speech sounds so that it perceptually approximates the target speaker's voice. Since MFCCs are computed based on human auditory systems [19], perceptual aspects have been considered to some extent in the VC techniques that have been designed to minimize the differences in MFCCs. In most VC schemes, however, the differences perceived by the human ear were not sufficiently addressed in constructing the conversion rules. For example, the conversion rules for the spectral envelope of most conventional VC methods either

The Structure of the Proposed VC Method
The overall procedure proposed for the VC method appears in Figure 1 wherein a typical conventional VC scheme is also presented for comparison. The first step of VC is analysis that extracts a set of speech feature parameters of both the reference and target speakers. The spectral envelop parameter and spectral fine structure were used as feature parameters, which were associated with the vocal tract transfer function and prosody information, respectively. The linear prediction coefficient cepstrum (LPCC) was chosen to represent the spectral envelope and the spectral fine structure was represented with the pitch period. In practice, even if the reference/target speakers utter the same words, it is unlikely that a synchronized set of feature sequences would be produced. Dynamic Time Warping (DTW) [39] was first applied in a preprocessing step in order to time-align these sequences. Time-alignment using DTW produced frame-level synchronized sequences, but waveform-level synchronization between the neighboring frames, is not guaranteed. This could potentially result in occurrence of undesired pitch-pulse misalignment. To cope with this problem, a synchronized overlap and add (SOLA) method [40] was applied to the frame-level time-aligned target speech. A pairing of reference speech and time-aligned (both in frame-level and waveform-level) target speech was used to construct the conversion rules and for evaluation. An example of the final time-aligned target speech that is subsequently used for construction of the conversion rules and evaluation is shown in Figure 2. This shows that the onset/termination times of the time-aligned target speech are relatively consistent with those of reference speech.
When F H and F E denote the conversion functions for the LPCC and the spectral fine structure, respectively, then the optimal conversion rules for conventional VC methods, F * H , F * E are given by are the sets of the reference and time-aligned target LPCCs, respectively, and N is the total number of the parameters for constructing the conversion rules. In a similar manner, E r = {e r,n } N n=1 and E t = {e t,n } N n=1 are the sets of the spectral fine structures for reference and target speakers, respectively. D H and D E are the objective functions for the LPCC and the spectral fine structures, respectively. The mean squared error (MSE) was mostly adopted as the objective function in previous VC methods. Equation (1) indicates that the conversion rules for two parameters are independently obtained by minimizing each objective function, as shown in the top of Figure 1.
In the proposed method, construction of the two conversion rules was achieved by minimizing the unique distance measurement, where D PD is the perceptual distance measure that is explained in the next section. As illustrated in Figure 1, the distance measurement is computed using the synthesized speech signals and the target speech signals. That is a major difference from conventional VC schemes, in which the distance measurements are independently computed using the corresponding feature parameters. Independent minimization of each feature parameter leads to producing the converted speech which is close to the target speech. However, since the synthesized speech signals are directly heard, it is more desirable to minimize the differences between the target speech and the synthesized (converted) speech. It is not possible to simultaneously obtain F * H and F * E . Hence, incremental estimation was adopted in this study. Beginning with the initial rules F The detailed explanation of minimization (4) is given in the next section. The minimization process is repeated until a convergence threshold is reached. Assuming that each re-estimation stage yields the conversion functions that minimize the perceptual distance, the algorithm ensures a non-increasing sequence of the perceptual distances such as where D (i) ). This can be easily proved as follows. First, since the minimum criterion is adopted in the re-estimation stage for H, the D PD is at least as small as that for the previous re-estimation stage for E. Therefore, the following inequality holds for every i Next, the re-estimated F E by (4) yields the minimum D PD . Thus, the following inequality also holds for every i: From (6) and (7), it can be easily proved that this inequality D PD holds for every i.

Perceptual Distance
Although the MSE has previously been shown to be a reasonably successful choice both for modifying speaker individuality and obtaining transformed voice with high quality, it was not guaranteed that the MSE necessarily reflected the perceptual differences. The objective of this study is to incorporate a distance metric that sufficiently reflects the perceptual differences into the conversion rules. Our assumption is that the usage of a perceptually relevant distance metric ensures that the resulting converted speech sounds perceptually closer to target speech, and it is hoped, would outperform the conventional MSE-based methods. There are several ways to implement a perceptually relevant distance metric. The properties of the human auditory system was mostly exploited in this kind of distance metric. Hence, frequency-selective emphasis, non-uniform frequency sampling, and loudness transformation were adopted to measure the perceptual distance. In the present study, the structure of the distance metric used in the PESQ, which quantitatively measured the degree of perceptual degradation, was employed to measure the distances between the converted and target speech signals. Accordingly, the traditional MSE-based objective function was modified by incorporating both a symmetrical disturbance, D (s) , and an asymmetrical disturbance, D (a) [25] where MSE n is the MSE of the n-th spectrum where |X t,n (m)| 2 and |X t,n (m)| 2 are, respectively, the target and converted power spectra obtained by multiplying the spectral envelope derived from the LPCC, h t,n (m), and the spectral fine structure, e t,n (m), while σ m is the standard deviation of |X t,n (m)| 2 . Indices n and m denote, respectively, frame and frequency, while M is the number of frequency bins. The number of frequency bins was chosen according to the frame length and the sampling frequency, that was 256. In (8), α and β are weighting factors for each disturbance. The symmetrical disturbance reflects the absolute difference between the converted and target loudness spectra when auditory masking effects are account for. When the symmetrical disturbance is applied to VC, it can be regarded as a distance function between the reference and target speakers measured in a domain that reflects human auditory system. There are two types of difference patterns in VC, one where the target value is greater than the reference value and vice-versa. Such difference patterns cannot be reflected on the distance metric such as MSE and the symmetrical disturbance. Whereas the signs of the loudness differences are considered in the asymmetrical disturbance. The negative differences (loss of target spectral component) and positive differences (residuals of reference spectral component) are differently perceived owing to masking effects. By using the asymmetrical disturbance, the differences between the two speakers can be described in more detail, which can lead to the improvement of the VC performance. The calculation of symmetrical and asymmetrical disturbances reflects the human auditory system and is composed of several steps, briefly described as follows [25,30]: (1) Perceptual domain transformation: The target and converted loudness spectra which are perceptually closer to the actual human listening are obtained as follows, where Q is the number of Bark bands, H is a Bark transformation matrix that converts the power spectra is the mapping function that converts each band of the Bark spectrum to a sone loudness scale as follows, where s l is a loudness scaling factor, P 0 (q) is the absolute hearing threshold for the q-th Bark band and γ is set to 0.23 [25] (2) Disturbances computation: A relative small difference between the target and converted loudness spectra can be negligible [25,30,41]. Accordingly, a center-clipping operator over the absolute difference between the loudness spectra was applied to compute the symmetrical disturbance vector as follows, d where m n = 0.25 · min(ŝ t,n , s t,n ) is a clipping factor and | · |, min(·), and max(·) are applied element-wise, while 0 is a zero-filled vector of length Q. The asymmetrical disturbance vector is obtained as d (a) n = d (s) n r n , where denotes an element-wise multiplication and r n is a vector of asymmetry ratios with components computed from the Bark spectra, For the speech enhancement task, the constants and λ were set to 50 and 1.2, respectively [25]. In this study, the experiments were carried out to optimally determine the two constants, and λ.
The experimental results showed that the same values adopted in [25] also yielded the minimum D PD . The symmetrical and asymmetrical disturbance terms in (8) are given by the weighted sum of each disturbance vector, where the components of the weight vector w b are proportional to the width of the Bard bands, as explained in [30].

Estimation of the Conversion Parameters
The overall procedure of constructing the conversion parameters is explained in Figure 3. Basically, the converted and target speech signals are represented in the frequency domain, since the perceptual distance is computed using the power spectra. The converted spectra were given by multiplying the spectral envelope derived from the LPCC and the pitch-scaled spectral fine structure. A supervised learning framework was adopted where DNN [31] are used to estimate the conversion rules for the LPCC, F H . The power spectra necessary for calculation of the perceptual distance is obtained by where G is the transformation matrix that transforms the LPCC vector into the power spectrum. The elements of the matrix G are given by where N L is the order of the LPCC. The updated estimate of the DNN weights W with a learning rate λ W is computed iteratively as follows: The conversion rule for the spectral fine structure, F E , was achieved by pitch modification wherein the time-domain pitch-synchronized overlap and addition (TD-PSOLA) [6] method was performed on the LP-residual of the reference speech. Note that pitch modification was adapted only to the voiced regions, and hence, the pitch locations of the unvoiced regions were not changed. Since TD-PSOLA was implemented in the time domain, the modified reference spectrum was obtained by discrete Fourier transform (DFT) of the pitch-scaled LP-residual signal. Estimation of the conversion rule F E is then formulated as finding the optimal pitch modification factor that minimizes the overall perceptual distortion with the given converted LPCCs, as shown in (4). Although there was no explicit relationship between D PD and the pitch modification factor, the convexity of the perceptual distance function over the pitch modification factor was clearly observed for all conversion pairs, as shown in Figure 4. Accordingly, the gradient descent algorithm was employed to find the pitch modification factor as follows: where ϕ n is the pitch modification factor that is estimated at the n-th iteration. A learning rate λ ϕ was heuristically determined, that was 0.01. Note that the derivative term ∇ ϕ D PD cannot be computed mathematically. The mean value theorem was employed to approximate ∇ ϕ D PD as follows; where D PD (X t ,X t |ϕ) is the perceptual distance in case when the pitch modification factor is given by ϕ.

Experiment Setup
The evaluation was carried out using the CMU ARCTIC database [42] for US English and was sampled at 16 kHz. These databases were constructed as phonetically balanced, and originally designed for unit selection speech synthesis research. These databases consist of around 1150 utterances and includes US English male and female speakers as well as other accented speakers. Among them, two male speakers, bdl and rms, and two female speakers, clb and slt were used. Four different voice conversion tasks were investigated including male-to-male (rms → bdl), male-to-female (bdl → slt), female-to-female (slt → clb) and female-to-male (clb → rms) conversion. To obtain the conversion rules, 200 utterances were used, and the remaining 100 utterances were prepared for evaluation. The order of the LPCC was 20. The speech data were analyzed pitch-synchronously, at the manually labelled pitch marks. For voiced regions, the frame length was set to two or three pitch periods depending on the pitch modification factor [6], whereas the frame length was set to be constant (=25 msec) for unvoiced regions. A pre-emphasis factor 0.95 was applied.
Since it is impossible to mathematically determine the optimum number of hidden layers and optimum number of hidden nodes, we performed several experiments to investigate the relationship between the number of hidden layers and objective performance in terms of overall perceptual distances. No clear relationship between them was found. According to the experimental results, the best performance was achieved in case when the DNN had three hidden layers and the number of the nodes in the hidden layer was set to 100. To prevent the network from converging to poor local minima, a deep generative model of input features was adopted to initialize the network by a stacking of multiple restricted Boltzmann machines (RBMs) [43,44]. The number of RBM pre-training epoches in each layer was 20. The learning rate of the RBM training was set as 0.0005. A fixed learning rate of 0.001 was applied for the fine-tuning of the baseline. The total number of epoches at the fine-tuning stage was 50. For both RBM pre-training and fine-tuning, The momentum was set to 0.05 for the first five epochs, then maintained at 0.07 thereafter. Mean and variance normalization was applied to the input and target feature vectors of the DNN. The performance of the DNN was expected to be improved by dropout regularization [45]. Hence, dropout regularization with a keep probability of 0.8 was employed.
For comparison, the performance of the four conventional VC methods including the minimum mean square error (MMSE)-based joint Gaussian method (JGMM) [20], the maximum likelihood trajectory conversion method (JDGMM) [21], dynamic frequency warping with amplitude scaling (DFW) [22], and DNN-based conversion with independent pitch scaling (MLP-ind) were also evaluated. For all these methods, pitch modification with a fixed scale factor was employed to convert the spectral fine structures. For each conversion, the pitch modification factors were determined so that the statistical properties (mean and standard variation) of the converted pitch periods were matched with those of the target pitch periods [24]. Three measurements, Mel-cepstral distance (MCD), perceptual distance (8), and PESQ, were employed to evaluate the performance of each method objectively. Note that all three measurements are relevant to distortions perceived by the human auditory system. PD and PESQ, however, were newly adopted in this study. The listening tests were conducted to subjectively evaluate the validity of the proposed VC method. The ABX test and a preference test were performed wherein stimuli consisting of 10 sentences were presented to 20 subjects (15 males, 5 females, ages ranging from 21 to 51 years, mean: 34.3, standard deviation: 10.8). All subjects had normal hearing ability. Although they were native Korean, they had participated in many VC tests using English utterances. In the ABX test, the first and second stimuli, A and B, were either the reference speaker's or the target speaker's, while the last stimuli X was converted speech achieved using the underlying methods. The subjects were then asked to select either A or B as a candidate for X. The subjects were allowed to listen to each utterance as many times as they wished before making a judgment.
Along with the ABX test, a preference test was conducted in which the same subjects participated in the ABX test listened to two randomly selected converted utterances per method and conversion pair. The subjects were asked to choose the perceptually preferred stimuli. In this test, each pair of stimuli consisted of the two converted utterances, one from the proposed method and the other from the conventional methods. Since this test was designed to evaluate the overall quality rather than voice personality, the subjects were asked to pay more attention to the naturalness and intelligibility of the converted speech signals.

Determination of the Weights for Each Disturbance
The weights (α, β) for each of the disturbances in (8) were first determined so that the average PESQ was maximized. This provided the necessary information for calculating the perceptual distance in the future experiments. The average PESQs according to the weight for the asymmetrical disturbance, β in (8) are plotted in Figure 5 where the weight for the symmetrical disturbance α is given by 1 − β. The correlations between the two variables (average PESQ and the asymmetrical disturbance) were −0.9321, −0.9722, −0.9888, and 0.7172 for conversion pairs r2b (rms → bdl), b2s (bdl → slt), s2c (slt → clb), and c2r (clb → rms), respectively. This means that except for c2r, lowering the asymmetrical disturbances was helpful in increasing the PESQ. Such results are somewhat different from in the case of speech enhancement, where the asymmetrical disturbance contributed to an increase in perceptual quality [25]. This results indicated that the perceptual similarity between the converted speech and the target speech was not remarkably affected by the residual components of the reference speech spectra and the loss of the target speech spectra. Whereas in speech enhancement [25], the residuals of the unwanted components (noise spectra) and the loss of the desired components (signal spectra) highly affected the perceptual similarity to the original speech signals. A possible reason for this results is that in speech enhancement, the unwanted components are always less correlated with the desired signal components, and hence, the residuals of the unwanted components and the loss of the desired components seriously degraded the quality of the reproduced speech signals. Whereas in VC, both the unwanted components and the desired components correspond to the reference and target spectra, respectively. The degree of the correlation between the two components may be varied according to the underlying two speech signals (reference and target). In other words, usefulness of the asymmetrical disturbance may be determined by combination of the reference/target signals. For example, the two speech signals are perceptually more correlated for the pairs (rms, bdl), (bdl, slt), and (slt, clb), compare with the pair (clb, rms). The conversion of clb → rms is female-to-male conversion, and hence, it can be reasonably assumed that the perceptual correlation between them is not as high as other pairs. The conversion of rms → slt is also different gender conversion. However, the degree of the correlation in the perceptual domain between them is assumed to be higher than the pair (clb, rms).
In the follows, the weights for each of the disturbances that yielded the highest average PESQ were adopted for each conversion pair.

Objective Evaluation
The results are presented in Figure 6. The JGMM method revealed the best performance in terms of MCD for all conversion pairs. This result was due to the MCC minimization criterion adopted in the JGMM method. In terms of PD and PESQ, the proposed VC method was superior to the other methods for all conversion pairs. This results can be explained by the fact that the objective function for the proposed VC method was similar to that adopted in calculating the PESQ. The original purpose of the PESQ was to perceptually compare the overall quality of clean (untouched) speech with that of reconstructed (or distorted) speech [25,30]. The role of the PESQ in distinguishing the voices of different speakers has not been discussed to date. Our assumption was that even if two different speakers uttered the same sentence and the two voices were time-aligned using DTW and SOLA, the PESQ between the two voices would be very small. This assumption was verified by the experimental result wherein the average PESQs between the reference voices and time-aligned target voices were 0.922, 0.581, 1.084, and 0.614 for rms → bdl, bdl → slt, slt → clb, and clb → rms conversions, respectively. Considering the range of PESQs is −0.5 to 4.5 [30], these values are remarkably low, and hence, the PESQ is also a good indicator of differences in voice personality. The average PESQ of the proposed method was always higher than that between the reference and time-aligned target voices for all conversion pairs, as shown in Figure 6. Such improvements in PESQ mainly came from conversion to target speech, since no attempt to improve the quality was carried out on reference speech. The experimental results also showed that the correlation between the perceptual distance and the PESQ was −0.7315, whereas the correlation between MCD and the PESQ was 0.4805. This was graphically verified by the scatter plots presented in Figure 7 where the perceptual distance is more clear correlated with the PESQ, compared with MCD. Consequently, the distance metric adopted in this study is more usful for the prediction of perceptual similarity to target speech by comparison with the previously employed distance metric.

Subjective Evaluation
Although it can be inferred in the previous section, that voice personality was one of the major factors affecting the PESQ, it was worthwhile to verify whether the PESQ results were consistent with those from a subjective listening test. Figure 8 lists the results of the methods, other than the proposed method, that yielded the highest score for each conversion pair. Such results are consistent with those obtained from the objective evaluation, including PD and PESQ. This confirms that PD and PESQ well predicted the perceptual quality of the converted speech and potentially replace the subjective listening tests. The listeners indicated that the voices converted by the proposed method sounded more clear than those from the MMSE, JGMM, and JDGMM methods. A common characteristics of these three methods is that the converted features are given by a linear combination of some representative vectors (e.g., mean vectors of each Gaussian component). This resulted in ambiguous and unclear voices, due to the averaging effects. Such undesired effects were alleviated by adopting the global variance (GV) compensation method [21]. The proposed method yielded the perceptually preferred voices without GV compensation. It was not clearly verified whether perceptually more pleasant quality of the proposed method came from the properties of the DNN-based estimator or from the adopted objective function. Considering the fact that the MLP-dep (proposed) method yielded higher preferences than MLP-ind, it can be said that using conversion rules based on perceptual distance is one of the contributions of improvements in perceptual quality. Consequently, although the evaluations were carried out on the limited number of speakers, one language, and the limited number of the subjects, the results appear to be somewhat promising in that some improvements over the conventional MSE-based VC methods especially in perceptual quality could be achieved by employing the perceptual distance measurement.

Conclusions
A voice conversion method was proposed, based on a perceptually meaningful criterion. The objective function resides in the conventional MMSE and in the perceptual distance. The conversion rules for spectral envelop and spectral fine structure were jointly constructed in an iterative manner so that the perceptual distance was decreased incrementally. The effectiveness of the proposed method was confirmed through both objective and subjective evaluations. The experimental results also showed that the perceptual distance revealed a strong correlation with the PESQ. Moreover, it was confirmed that the results of the PESQ were consistent with the subjective listening test results. Currently, a simple conversion method is adopted for the spectral fine structure, which is based on PSOLA with a global pitch modification factor. More complicated conversion schemes for spectral fine structures will be considered in future study, and these will include pitch modification as well as LP-residual conversion.