Voice Transformation Using Two-Level Dynamic Warping

Voice transformation, for example, from a male speaker to a female speaker, is achieved here using a two-level dynamic warping. An outer warping process, which temporally aligns blocks of speech (dynamic time warp), invokes an inner warping process, which spectrally aligns based on magnitude spectra (dynamic frequency warp). The mapping function produced by the dynamic frequency warp is used to move spectral information from a source speaker to a target speaker. Information obtained by this process is used to train an artificial neural network to produce spectral warping output information based on spectral input data.


INTRODUCTION
Voice transformation refers to the process of changing the parameters of the speech or changing voice personality, to convert the speech uttered by one speaker (source speaker) to sound as if other speaker (target speaker) had spoken it [1].Voice transformation has applications such as text-to-speech synthesis (TTS), international dubbing, health-care, multi-media, language education, music, and preprocessing for speech recognition, etc. [2,3,4].In conventional voice transformation acoustic parameters (such as pitch and formant information) of the speech signal are collected from the source and target speaker voices to find a transformation map that be used to map the parameters of the source speaker onto the target speaker signal [5].Altering formant frequencies by moving the pole locations of the related formant produces changes in the vocal tract model.That movement played as an important factor in voice transformation [6].Extraction, modified and mapping pitch also considered as an important factor in voice transformation [7].
The approach taken here avoids the need to find acoustic parameters (e.g., pitch or formant model), and instead deals directly with spectral information.The transformation is accomplished using a two-level dynamic warp (DW).Based on the two-Level DW it is straightforward to map the source speech to target speech when both are available.But if the target speech is already available saying the desired target sentence, why bother with the transformation?A more challenging, but realistic, setting is when the target is not available saying the desired statement.Thus, a second phase of the research is to train an artificial neural network to produce the spectral warping function from only the source speaker information, based upon which the source speech may be warped to the target speech.

DYNAMIC WARPING
Dynamic time warping (DTW) is a method can be used to find an optimal alignment between the sequences of feature vectors from two different sources [8].DTW has been used for a wide range of applications such as speech recognition [9], human motion animation [10], human activity recognition [11] and processor cache analysis [12].We have extended DTW to a two-level mode, where the outer level seeks temporal alignment, using a norm computed using an inner frequency warping to align spectra.This has been used as a tool to help train caregivers working with individuals with speech disabilities [8].
It is obvious from the name "dynamic time warping" that DTW temporally aligns block of features to compensate for different speech rates.For example, temporal alignment between male speech segment and female speech segment by comparing certain features occurring at as a function of time.Dynamic frequency warping (DFW) or "inner warping" can also be used to perform spectral alignment based on spectral magnitudes of blocks of speech data, such as aligning spectral features of male speaker to spectral features of female speaker.In this paper we applied the combination of inner and outer warping, simply referring as "dynamic warping," or DW.
Our explanation for the DW starts with a brief summary of the outer warping (DTW).For two speakers i (i = 1, 2), let si( , m) denote the feature at the temporal block number and frequency index m.The vector si( , :) is a (spectral) feature vector at block of user i.A speech signal for speaker i (such as a spoken sentence) is represented as a sequence of Ti feature vectors computed from overlapping blocks of speech information Si = {si(1, :), si(2, :), . . ., si(Ti, :)}, i = 1, 2, Let d(s1( 1, :), s2( 2, :)) denote a metric distance between vectors s1( 1, :) and s2( 2, :) at a specific block times 1 and 2. This metric is computed using DFW, as explained below.DTW is employed to find the distance between the sequence S1 and the sequence S2.The notation dT ( 1, 2) represents the minimal cost between the sequences S1 and S2, up to block numbers 1 and 2, allowing some alignment motion between blocks of the two different users.(The subscript T refers to time warping.)dT ( 1, 2) is recursively computed as At the end of that specific portion of speech ( 1 = T1, 2 = T2), time warping produces an overall warped metric distance 143 978-1-7281-4300-2/19/$31.00 ©2019 IEEE Asilomar 2019 dT (S1, S2) Source Speaker Target speaker Fig. 1.Distnce Measure for Outer Dynamic Warping dT (T1, T2) between the sequences S1 and S2, which may be represented as dT (S1, S2).This is suggested in Figure 1.This DTW accounts for temporal shifts due to the rate of speaking differences between the two speakers.The output of this DTW is the temporal alignment of the source signal with respect to the target signal.
As mentioned above, the distance d(s1( 1, :), s2( 2, :)) can be obtained through warping process between these feature vectors.This warping is called dynamic frequency warping (DFW) or inner warping.Frequency warping can be calculated in a way similar to the time warping.Let si(:) = si( i, : ), i = 1, 2 denote the spectral information at timeslice i that is passed in from the outer warping (DTW) to the inner warping (DFW).DFW is applied to find the distance between s1(:) and s2(:) as Here the subscript S refers to a frequency warping, and dist( s1(k1), s2(k2)) is represent the metric distance between elements of the spectral vectors.In our case, we use spectral feature vectors computed as the positive frequency components of FFT of windowed data.The metric distance used looks only at spectral magnitude, so At the end of frequency warping process, the distance between spectral vectors is computed as where K is the number of elements in each spectral feature vector.DFW also produces a sequence of indices a = (a(1), a(2), . . ., a(M )) and b = (b(1), b(2), . . ., b(M )), which are called the warping function paths, such that the new spectral vectors s1( 1, a(:)) and s2( 2, b(:)) are as similar as possible (e.g., peaks and valleys of s1 align with peaks and valleys, respectively, of s2.)The source speaker s1 spectrum is transformed to match the target speaker s2 by creating a modified source speaker ŝ1 according to This spectral alignment map drags spectral components of source blocks to produce transformed speech in the frequency domain.This data is inverse Fourier transformed and added in The length M of the warping function paths may vary from one spectral feature vector to another, depending on how many turns the spectral warping paths have.
The combination of the two warping processes is illustrated in Figure 2. Starting from the speech signal at the bottom of the diagram, the speech signal is split into different overlapping segments.Feature vectors (spectral information) for each segment was computed using an FFT.These spectral features information are passed through DTW for temporal alignment, where at every stage of the temporal alignment, spectral alignment or warping is employed using DFW.

SPECTRAL FEATURES
Experiments were carried out on CMU ARCTIC database, which consist of seven speakers (two US females, two US males, one Canadian male, one Scottish male and one Indian male).Each speaker has recorded a set of 1132 phrases.In experiments presented here one US male and one US female are chosen to do the voice transformation from male to female.
Speech data was sampled at 16000 samples/sec.These feature vectors that were used in the DW were considered to be the positive-frequency spectral information in the frequency domain, calculated using the FFT.Each sentence was temporally segmented into 32-ms segments, using a Hamming window with 16-ms overlapping, zero-padded, then transformed using a 512-point FFT.The K = 256 positive frequency spectral elements in the frequency domain were used as a feature vector.

SPECTRAL WARPING USING ANN
The two-level dynamic warping procedure described above was used to obtain training data to train an artificial neural network (NN).The NN input data was spectral feature vectors from the source speaker.The NN output data was spectral warping The length of the warping paths a and b may be different from one segment of speech to another.In order to have the neural network be able to have a constant output length (without zero-padding the output data, which would introduce warping artifacts) the a and b information are interpolated to produce a modified a and b which is the same (maximum) length for all segments of the speech vector.This warping is computed as described here using an example.The above process can be repeated for function path b to find brevs.Once these interpolated values are computed, then the warped spectrum is computed by created an interpolated spectrum according to ŝ1( brevs(i)) = s1(ãrevs(i)), i = 1, 2, . . ., M.

ANALYSIS
After identifying the starting point and ending point of the first 500 phrases of CMU ARCTIC database (with recordings of US male and US female speakers), DW was applied on the spectral magnitudes feature vectors that are extracted from the FFT for each 32-ms segment of speech with 50% overlap windowed using a Hamming window.Temporal alignment (DTW) was done on the source spectral feature vectors with respect to the target male signal.(The temporal aligned source spectral feature vectors were saved for later use within the NN.) DFW was applied to do the spectral alignment and to find the sequence of path indices (a and b).These indices were used to transform the source speaker s1 spectrum to match the target speaker s2.Different NN structures were applied (such us number of layers, number of neurons) to do a voice transformation.After trying several architectures, we chose to use two hidden layers with 500 and 1000 neurons for the first and second layers, respectively.Neural network computations were performed using TensorFlow in Python.

RESULTS
Figure 3 shows a typical spectrogram for voice transformation from male speaker to female speaker using the phrase "Author of the danger trail, Philip Steels, etc." from the ARC-TIC database using the training data (known male and female speaker information).Figure 6 shows the spectral feature information for one segment of speech for female with the time aligned spectral warped segment for male.By the DFW, the peaks and valleys of the male spectrum are aligned to the locations of the peaks and valleys of the female spectrum.Figure 7(a) shows a spectrogram warped information of Neural Net result as a voice transformation from male speaker to female speaker.Acoustically, the transformed signal is clearly looks like the previous result.Figure 7(b) shows the mean square error.Figure 8 shows how the NN has learned a warping path (the a warping data) at various segments of the speech signal after 4000 training iterations.The DW path is shown in blue and the NN-learned path is in orange.

CONCLUSION
The DW warping method introduced here produces femalesounding speech from male speech, but with some signal pro-cessing artifacts.This is also the case when the warping information is provided by a trained NN.
Future work will focus on how to improve the mapping by exploring and eliminating the causes of the signal processing artifacts, and exploring different NN architectures.Another item of interest is whether this warping method can be used to produce different language accents.

Fig. 3 .
Fig. 3. Spectrogram for Male, Female and Warped Male to Female

Fig. 4 . 1
Fig. 4. Making DFW training data, and using this to train a neural network

Fig. 5 .
Fig. 5. Normal and interpolated paths Inner/Outer Dynamic Warping sequence to produce the transformed signal.After warping, filtering is performed to mitigate signal processing artifacts.