Next Article in Journal
Force Tracking Control of Functional Electrical Stimulation via Hybrid Active Disturbance Rejection Control
Next Article in Special Issue
Analytic Design of on-Chip Spiral Inductor with Variable Line Width
Previous Article in Journal
Optimal Scheduling of Cogeneration System with Heat Storage Device Based on Artificial Bee Colony Algorithm
Previous Article in Special Issue
Implementation of an Environmental Monitoring System Based on IoTs
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

CycleGAN-Based Singing/Humming to Instrument Conversion Technique

1
Department of Computer and Communication Engineering, National Kaohsiung University of Science and Technology, Kaohsiung 82445, Taiwan
2
Ph.D. Program in Engineering Science and Technology, College of Engineering, National Kaohsiung University of Science and Technology, Kaohsiung 82445, Taiwan
*
Author to whom correspondence should be addressed.
Electronics 2022, 11(11), 1724; https://doi.org/10.3390/electronics11111724
Submission received: 18 May 2022 / Revised: 23 May 2022 / Accepted: 23 May 2022 / Published: 30 May 2022
(This article belongs to the Special Issue Intelligent Signal Processing and Communication Systems)

Abstract

:
In this research, singing/humming to instrument conversion techniques are proposed. In humming to instrument, two models based on cycle-consistent adversarial networks (CycleGAN) on viola are experimented. From the objective and subjective evaluations conducted, the converted audio is more similar to viola compared to humming, and the quality of the converted sound is fair to listeners. In singing to instrument, to fix the problem of the gap between singing and instrument, a dual conversion model consisting of singing to humming and humming to instrument is proposed. The objective and subjective experimental results show that the dual conversion has better converted audio quality than conversion by singing to instrument directly.

1. Introduction

Voice conversion (VC) is a technique for converting one voice into another voice with a different timbre under the condition of keeping linguistic information. In recent years, there have been many applications based on voice conversion, such as vocal conversion [1,2,3,4,5,6,7,8,9,10,11,12,13], singing voice conversion [14,15,16], emotion conversion [17,18], speech style conversion [19,20], conversion of whispers to normal voices [21,22], conversion of singing skills [23], voice correction [24], and so on. However, there are few studies on the conversion of human voices to musical instruments.
The structure and resonance principle of the human vocal tract is very similar to that of musical instruments. Musical instruments have resonance boxes to produce complex timbres; similarly, human beings use vocal cord vibration and vocal tracts such as the nose, pharynx, mouth, and larynx to produce sound. This allows humans to simulate timbres close to the sound of musical instruments, and provides the possibility and interest in building a singing/humming to instrument conversion system.
The techniques of voice conversion can be divided into parallel and non-parallel methods. The feature of the parallel voice conversion system is that the linguistic contents of source and target speech are the same, and the voices are collected as training data. For example, Chen et al. [1] and Toda et al. [2] built parallel voice conversion systems based on the Gaussian mixture model. Others use neural networks [25,26] to build parallel voice conversion systems. For example, Desai et al. used an artificial neural network (ANN) [3], Chen et al. [4] used a deep neural network (DNN), and Nakashika et al. [5] used a recurrent neural network (RNN).
On the other hand, non-parallel voice conversion is based on source-target non-parallel data, and is therefore more challenging. However, it will be more advantageous, because the non-parallel voice conversion system does not need to rely on parallel data, which reduces the difficulty of collecting data. Non-parallel voice conversion can be divided into feature disentangle and direct transformation. Feature disentangle extracts features of audio to convert. The content of the input voice is preserved, and the timbre is changed. In this way, the target speaker can speak the same content as the source speaker. For example, Sun et al. [6] combined automatic speech recognition (ASR) with RNN to design a non-parallel voice conversion system, Liu et al. [7] used WaveNet vocoder for voice conversion, and Saito et al. [8] proposed a non-parallel voice conversion using variational autoencoders conditioned by phonetic posteriorgrams and d-vectors.
The other non-parallel voice conversion system is based on direct transformation, which does not need feature extraction. For example, Kaneko et al. combined cycle-consistent adversarial network (CycleGAN) with gated convolutional neural network (gated CNN) to design a non-parallel voice conversion system—CycleGAN-VC [9]. They also proposed CycleGAN-VC2 [10], an improved version of CycleGAN-VC with an improved objective, generator, and discriminator. CycleGAN is also combined with the linear conversion of fundamental frequency by Fang et al. [11] and Wang et al. [12] added zero-centered gradient penalties on it. Aside from CycleGAN-based techniques, Serrà et al. [13] built a non-parallel conversion system based on WaveNet.
In this research, we use a CycleGAN-based technique to build a non-parallel singing/humming to instrument conversion system. Two systems of CycleGAN-VC and CycleGAN-VC2 based humming to viola conversion are experimented. In addition, in order to improve the naturalness of the converted audio in singing to viola, a dual conversion model consisting of singing-to-humming and humming-to-viola is experimented. These proposed methods and the conventional CycleGAN-VC and CycleGAN-VC2 will be compared.
The conventional CycleGAN-VC and CycleGAN-VC2 will be introduced in Section 2. The framework of the proposed methods will be presented in Section 3. The training and testing datasets, the objective and subjective evaluation methods, and the evaluation results will be shown in Section 4. Finally, the conclusion and future perspectives are conveyed in Section 5.

2. CycleGAN-VC and CycleGAN-VC2

The voice conversion systems, CycleGAN-VC and CycleGAN-VC2, are based on CycleGAN [27] and learn input-to-output mapping without relying on parallel data. Different from a generative adversarial network (GAN), their architecture can be defined as an auto-encoder. They use adversarial loss, cycle-consistency loss, and identity-mapping loss to learn the mapping from source to target, thereby achieving the goal of smoothly converting source acoustic features into target acoustic features.
In the training process of CycleGAN-VC, three types of losses, namely adversarial loss, cycle-consistency loss, and identity-mapping loss, are used to control the process of conversion. Assume X is the source and Y is the target. In forward-inverse and inverse-forward mapping, two generators and two discriminators are used. G X Y is the forward generator, which converts X to Y ; G Y X is the inverse generator, which converts Y to X . D X and D Y are the discriminators of X and Y , respectively. Both forward-inverse and inverse-forward mappings contain adversarial loss and cycle-consistency loss, and their main work is to find the optimal pseudo pair from the unpaired audio data. Forward and inverse mapping, which map converted and original Y and X , on the other hand, use identity-mapping loss to retain linguistic information.
CycleGAN-VC uses a one-dimensional CNN in the generator. The discriminator is designed as a two-dimensional CNN to focus on spectral texture. CycleGAN-VC refers to the work of Johnson et al. [28], and adds down-sampling, residual, and up-sampling layers to the generator and discriminator. Pixel shuffler for up-sampling is used. In addition, like Cycle-GAN, CycleGAN-VC also uses instance normalization.
To deal with data with time order such as speech, one way is to use RNNs. However, RNNs are computationally demanding, making them difficult to achieve. Gated CNNs [29] can parallelize sequential data, and they have been successfully used in language and speech modeling. Therefore, it was added to CycleGAN-VC.
CycleGAN-VC2 is an improved version of CycleGAN-VC and it adopts two-step adversarial loss. To fix the over-smoothing problem of cycle-consistency loss, an additional discriminator was added, and an additional adversarial loss was also added to the circularly converted feature for inverse-forward mapping.
Unlike CycleGAN-VC, the CycleGAN-VC2 generator is not fully convolutional. The generator was upgraded from 1D (dimension) to 2-1-2D CNN to fix the problem of degradation from down-sampling and up-sampling. In addition, CycleGAN-VC2 uses a 1 × 1 convolution before or after reshaping the feature map to rescale the channel dimension.
The discriminator was changed to PatchGAN [30] in CycleGAN-VC2. PatchGAN uses convolutions in the last layer and determines the authenticity on the basis of the patch, which allows GAN to be trained stably while reducing parameters.

3. Proposed Methods

For singing/humming to instrument conversion, we present two proposed methods based on CycleGAN-VC and CycleGAN-VC2, namely CycleGAN-IC [31] and CycleGAN-IC2, respectively, and a dual conversion concatenation model consisting of singing to humming and humming to instrument, namely CycleGAN-ICd.

3.1. CycleGAN-IC and CycleGAN-IC2

CycleGAN-IC and CycleGAN-IC2 are based on and share the same generators and discriminators with CycleGAN-VC and CycleGAN-VC2. Suppose we want to convert a source into a target and we have acoustic feature sequences x Q × T x and y Q × T y belonging to audio source X and target Y. Q is the feature dimension and T x and T y are the length of the source and target sequences, respectively. The data distribution is denoted as x ~ P X ( x ) and y ~ P Y ( y ) . The objective is to learn mapping G X Y , which is a non-parallel conversion from x X to y Y .
In CycleGAN-IC, the adversarial losses, cycle-consistency loss, and identity-mapping loss are defined as in Equations (1)–(4). Equations (1) and (2) are adversarial losses, which aim to make G X Y (x) close to y, and G Y X (y) close to x.
a d v ( G X Y , D Y ) = E y ~ P Y ( y ) [ l o g D Y ( y ) ] + E x ~ P X ( x ) [ log ( 1 D Y ( G X Y ( x ) ) ) ] .  
a d v ( G Y X , D X ) = E x ~ P X ( x ) [ l o g D X ( x ) ] + E y ~ P Y ( y ) [ log ( 1 D X ( G Y X ( y ) ) ) ] .  
In Equation (1), the discriminator D Y aims not to be deceived by maximizing the loss, and G X Y aims to generate indistinguishable audio G X Y ( x ) by minimizing the loss. On the other hand, in Equation (2), the discriminator D X aims not to be deceived by maximizing the loss, and G Y X aims to generate indistinguishable audio G Y X ( y ) by minimizing the loss.
Equation (3) is the cycle-consistency loss c y c , where G X Y is the forward generator and G Y X is the inverse generator. The purpose of cycle-consistency loss is to measure the difference between the input audio and the audio after forward-inverse or inverse-forward, and to further regularize the mapping.
c y c ( G X Y , G Y X ) = E x ~ P X ( x ) [ G Y X ( G X Y ( x ) ) x 1 ] + E y ~ P Y ( y ) [ G X Y ( G Y X ( y ) ) y 1 ] .  
Equation (4) is the identity-mapping loss i d of G X Y and G Y X .
i d ( G X Y , G Y X ) = E y ~ P Y ( y ) [ G X Y ( y ) y 1 ] + E x ~ P X ( x ) [ G Y X ( x ) x 1 ] .  
The identity-mapping loss is the sum of the difference between y and G X Y ( y ) plus the difference between x and G Y X ( x ) . It encourages the generator to preserve the composition between the input and output.
The full objective loss of CycleGAN-IC is shown in Equation (5) as follows.
I C f u l l = a d v ( G X Y , D Y ) + a d v ( G Y X , D X ) + λ c y c c y c ( G X Y , G Y X ) + λ i d i d ( G X Y , G Y X ) .  
In Equation (5), the full loss of CycleGAN-IC, I C f u l l , is the overall sum of adversarial loss, cycle-consistency loss, and identity-mapping loss. The weighting of cycle-consistency loss c y c and identity-mapping loss i d are controlled by λ c y c and λ i d , respectively. CycleGAN-IC uses adversarial loss once in each cycle, which is called one-step adversarial loss. The optimum solution of the mapping X to Y and Y to X equals
arg min G X Y , G Y X max D X ,   D Y I C f u l l
On the other hand, CycleGAN-IC2 uses adversarial losses twice in each cycle, which are two-step adversarial losses and are defined as in Equations (7) and (8). D X and D Y are additional discriminators.
a d v 2 ( G X Y , G Y X , D X ) = E x ~ P X ( x ) [ l o g D X ( x ) ] + E x ~ P X ( x ) [ log ( 1 D X ( G Y X ( G X Y ( x ) ) ) ) ] .  
a d v 2 ( G Y X , G X Y , D Y ) = E y ~ P Y ( y ) [ l o g D Y ( y ) ] + E y ~ P Y ( y ) [ log ( 1 D Y ( G X Y ( G Y X ( y ) ) ) ) ] .
Finally, the full objective loss of CycleGAN-IC2 is shown in Equation (9) as follows.
I C 2 f u l l = a d v ( G X Y , D Y ) + a d v ( G Y X , D X ) + L a d v 2 ( G X Y , G Y X , D X ) + L a d v 2 ( G Y X , G X Y , D Y ) + λ c y c c y c ( G X Y , G Y X ) + λ i d i d ( G X Y , G Y X ) .
In conventional CycleGAN-VC and CycleGAN-VC2, λ c y c = 10 and λ i d = 5. When the training steps reach 10,000, λ i d changes to 0. Since singing/humming to instrument is different to human voice conversion, and the cycle-consistency loss aims to encourage G X Y and G Y X to find (x, y) pairs with the same contextual information, if the weighting of cycle-consistency loss is increased, the generator may fail to preserve the composition between the input and output. Therefore, unlike the original design of CycleGAN-VC and CycleGAN-VC2, which adopts larger weights of cycle-consistency loss, the weights of cycle-consistency loss in CycleGAN-IC and CycleGAN-IC2 are lower than the weights of identity loss, so that the identity loss is more important in the process of loss convergence, and the composition and timbre of the instrument can be preserved in the human voice to instrument conversion. In this research, we use λ c y c = 1 and λ i d = 5 in CycleGAN-IC and CycleGAN-IC2, and the value of λ i d does not change during the training.
RNN is generally used in sequential signals such as speech. However, it is computationally demanding. Considering singing, humming, or instrument sounds are also sequential, gated CNNs containing activation function—gated linear units (GLUs)—are also used in CycleGAN-IC and CycleGAN-IC2. Assume O l is the output of the l-th layer; U l , V l are learned convolutional model parameters; b l , c l are learned biases.
O l + 1 = ( O l U l + b l ) σ ( O l V l + c l ) ,
where is the element-wise product and σ is the sigmoid function. The element-wise product, also called Hadamard product, of m × n matrices A and B is defined as [ A ] i j [ B ] i j , for 1 i m ,   1 j m . Sigmoid function is formulated as σ ( x ) = 1 / ( 1 + e x ) .

3.2. Dual Conversion Model

Since singing contains lyrics besides melody, it is very different from the sound of the viola, which makes the converted audio in singing to instrument conversion remain the feature of singing and unable to resemble the instrument sound. To fix the problem, a dual conversion model, CycleGAN-ICd, is proposed. The dual conversion model is as shown in Figure 1. We propose converting human singing into human humming first and then converting it into musical instrument. Since humming does not contain lyrics, it is more similar to the instrument sound. The dual conversion makes the converted quality better than direct conversion from singing to instrument.
In addition to the generators G X Y and G Y X , and discriminators D X and D Y in CycleGAN-IC2, CycleGAN-ICd adds one generator G Y Z and one discriminator D z . The architecture of the generators and discriminators of CycleGAN-ICd is the same as CycleGAN-IC2. As shown in the left part of Figure 1, we will first convert singing to humming. The full objective loss is as Equation (9). The weighting of cycle-consistency loss c y c and identity-mapping loss i d are controlled by λ c y c and λ i d , respectively. In this research, we used λ c y c = 1, λ i d = 5. The value of λ i d remains unchanged during the training. After the singing to humming is finished, humming to instrument is processed. As shown in the right part of Figure 1, both the original and converted humming are input to the generator G Y Z to create a converted viola to fool the discriminator D z . Two additional adversarial losses, a d v 3 and a d v 3 c relating to original and converted humming, and identity-mapping loss i d 3 are added as in Equations (11)–(13). The full objective loss of the humming to instrument conversion in the dual model is the sum of a d v 3 , a d v 3 c , and i d 3 .
a d v 3 ( G Y Z , D Z ) = E z ~ P D a t a ( z ) [ l o g D Z ( z ) ] + E y ~ P D a t a ( y ) [ l o g ( 1 D Z ( G Y Z ( y ) ) ) ] ,
a d v 3 c ( G Y Z , D Z ) = E z ~ P D a t a ( z ) [ l o g D Z ( z ) ] + E x , y ~ P D a t a ( x , y ) [ l o g ( 1 D Z ( G Y Z ( G X Y ( x ) ) ) ) ] ,
i d 3 ( G Y Z ) = E z ~ P D a t a ( z ) [ G Y Z ( z ) z 1 ] ,

3.3. Theoretical Analysis of Convergence and Complexity

Game theory is the study of the predicted and actual strategic behavior of rational agents in the game and building the mathematical models for finding the optimal strategies. Each player aims to maximize his utility by choosing the action maximizing the payoff. The action of a player is picked regarding other players’ actions. Among the game types, zero-sum games, such as poker, Go, and chess, grab a lot of attention. Zero-sum games are games in which decisions by players cannot change the sum of the players’ utilities. Actually, the total benefit of all the players equals zero all the time, for each kind of strategy. In games, how to deal with finding optimal equilibrium from multiple equilibria and avoiding undesirable equilibrium is a critical issue in game theory.
For GAN-based technology such as the proposed methods, the generator and the discriminator are two players playing against each other in a repetitive zero-sum game. The generator model is parameterized by ϕ , and the discriminator is parameterized by θ . In this study, deep neural networks are used for the generator model G and discriminator model D. The cost functions of the proposed CycleGAN-IC, CycleGAN-IC2, and CycleGAN-ICd are listed in Equations (5) and (9), and the sum of Equations (11)–(13).
The alternating gradient updates procedure (AGD) is generally used to reach an equilibrium in GAN-based problems. Convergence can happen based on the assumption that one player, the discriminator, is playing optimally at each step, such as in Wasserstein GAN [32], which can continuously estimate the Earth-Mover (EM) distance (or Wasserstein-1) by optimizing the discriminator. The EM distance is defined as
i n f γ Π ( r ,   g ) E ( x , y ) ~ γ [ x y ] .
Π ( r ,   g ) is the set of all joint distributions γ ( x , y ) . Real distribution r and transformed distribution g are the marginals. EM distance is continuous and differential, so the Wasserstein GAN critic can be trained to optimality. However, the assumption is strong and unrealistic. Wasserstein GAN becomes unstable at times.
Convergence can also be analyzed by studying GAN training dynamics as a repeated game in which both the players are utilizing no-regret algorithms [33]. From game theory, it yields a proof for the asymptotic convergence for convex-concave case as follows. Sion’s minimax theorem [33] states that if Φ m , Θ n , and they are compact and convex, and the function J : Φ × Θ is convex and concave in the first and second, then
min ϕ Φ   max θ Θ   J ( ϕ , θ ) = max θ Θ   min   ϕ Φ J ( ϕ , θ ) ,
and an equilibrium exists. Sion’s minimax theorem can be proven [34] by Helly’s theorem, which is a statement in combinatorial geometry on the intersections of convex sets, and the KKM theorem of Knaster, Kuratowski, and Mazurkiewicz, which is a result in mathematical fixed-point theory. Therefore, if the generator and discriminator parameters are compact and convex, and the cost function is convex and concave in the first and second, there exists an equilibrium according to Sion’s minimax theorem. The procedure of getting such an equilibrium can be achieved in game theory if both players update their parameters by no-regret algorithms. The definition of no-regret algorithm is as follows [33]. Given convex loss function sequences L 1 , L 2 , …: K , to select a sequence k t ’s, which depends only on the previous L 1 , … L t 1 , if R ( T ) T = o ( 1 ) , the selection algorithm is no regret, where
R ( T )   t = 1 T L t ( k t ) min k K t = 1 T L t ( k ) .
However, the convergence does not hold for the practical non-convex case, which is the general case for deep learning. In practice, the generator and discriminator are deep neural networks, and the cost function is not necessarily convex-concave. In non-convex games, AGD can keep cycling or converge to local equilibrium. Local regret [35] is introduced, which shows that the game converges to local equilibrium in a non-convex case if a smoothed variant of online gradient descent (OGD) and mild assumptions are used. ω -local regret of an online algorithm by fixing some η > 0 is defined as follows.
ω ( T ) = def t = 1 T K ,   η F t , ω ( x t ) 2 ,
where ω is window size; t is iteration; T is total round; and K is a convex set. From [35], for any T 1 , 1 ω T , and η 1 , a distribution D on 0-smooth, 1-bounded cost functions on K = [ 1 , 1 ] exists, such that for any online algorithm,
E D [ ω ( T ) ] 1 4 ω T 2 ω .
Hence, in online learning, time smoothing truly captures non-convex optimization. The convergence of the proposed methods will further be shown through observing the convergence curve in the experiment in Section 4.
The computational complexities of CycleGAN-IC and CycleGAN-IC2 are the same as CycleGAN-VC and CycleGAN-VC2 because CycleGAN-IC and CycleGAN-IC2 share the same architecture of the generators and discriminators with CycleGAN-VC and CycleGAN-VC2. The computational complexity of CycleGAN-ICd is 1.5 times that of CycleGAN-IC2 because CycleGAN-ICd has three generators and three discriminators and shares the same architecture of the generators and discriminators with CycleGAN-IC2, while CycleGAN-IC2 has two generators and two discriminators. FLOPs (floating-point operations) [36] of the generators and discriminators are used for computational complexity analysis.
For each convolutional kernel, FLOPs equal
2 H W ( C i n K 2 + 1 ) C o u t .
H , W , C i n , K , and C o u t are height, width, the number of channels, kernel width, and the number of output channels, respectively. The parameters of the generator and discriminator architectures are presented in Table 1. For detailed operations in the tables, please refer to [10].

4. Experiment

Two experiments, including humming to viola and singing to viola, were conducted. For the first experiment, CycleGAN-IC and CycleGAN-IC2 were applied and compared with the conventional CycleGAN-VC and CycleGAN-VC2. For the second experiment, CycleGAN-ICd was used. Objective and subjective measures were evaluated on the converted results.

4.1. Corpus

The MIR-QBSH database [37] was used to perform experiments on singing/humming to viola. MIR-QBSH collects MIDI (Musical Instrument Digital Interface) and vocals including humming and singing. A total of 48 MIDI songs ranging from 11 s to 201 s, and 4431 8-s-clips of humming and singing are included. Four subsets—MIR-QBSH-1, MIR-QBSH-2, MIR-QBSH-3, and MIR-QBSH-4, recorded by four male singers, were used in our experiments. The first singer recorded 17 humming and 5 singing 8-s-clips, with a total length 2 min and 56 s, the second singer recorded 20 singing 8-s-clips, with a total length 2 min and 40 s, the third singer recorded 20 humming 8-s-clips, with a total length 2 min and 40 s, and the fourth singer recorded 20 singing 8-s-clips, with a total length 2 min and 40 s. In the experiment of humming to viola, MIR-QBSH-1 was used as a training set and MIR-QBSH-3 was used as a testing set. In the experiment of singing to viola, for CycleGAN-IC2, MIR-QBSH-2 was used as a training set and MIR-QBSH-4 was used as a testing set. For CycleGAN-ICd, both MIR-QBSH-1 and MIR-QBSH-2 were used as training sets and MIR-QBSH-4 was used as a testing set. The corresponding viola was created by the SONAR synthesizer [38] developed by Cakewalk from MIDI. The sampling rate of all the audio was 16 kHz.

4.2. Objective and Subjective Measures

In this research, both objective and subjective measures were applied. Root-Mean-Square Error (RMSE) of the Mel-Frequency Cepstrum Coefficients (MFCCs) and Mel-Cepstral Distortion (MCD) were used as objective measures to compare the converted audio with the original viola and the original humming or singing. In calculating the RMSE of MFCCs, the number of MFCC was 34, the analysis frame length was 1024 samples, and the frame shift was 256 samples. Since the RMSE must be calculated at the same length, the original viola was cut to fit the length. The smaller the RMSE value, the more similar the converted audio was to the original audio.
MCD was calculated by dynamic time warping (DTW), which aligns the signal to the time to calculate the difference. The smaller the value of the MCD, the more similar the converted audio is to the original audio. In MCD, the all pass constant value was set as 0.35, the Fast Fourier Transform (FFT) size was 512, and the order of mel-cepstrum size was 34.
In addition to objective measures, the mean opinion score (MOS) [39] and comparison mean opinion score (CMOS) [39] were used as subjective measures. The MOS scores were 1 to 5, meaning ‘Bad’, ‘Poor’, ‘Fair’, ‘Good’, and ‘Excellent’. CMOS compares two sounds by giving scores of −3 (much worse) to 3 (much better). To simplify the evaluation process, in this research, we used three score levels: −1 (worse), 0 (equal), and 1 (better).

4.3. Experiments Results of Humming to Viola

In this subsection, we present the objective and subjective experiment results of humming-to-viola conversion by CycleGAN-IC and CycleGAN-IC2, and compare them with the conventional CycleGAN-VC and CycleGAN-VC2. MIR-QBSH-1 and MIR-QBSH-3 were used in the experiment of humming to viola as training and testing sets. The convergence of the generator and discriminator losses of CycleGAN-VC, CycleGAN-VC2, CycleGAN-IC, and CycleGAN-IC2 are shown in Figure 2.
The RMSE and MCD between the converted and original viola in the testing set are shown in the box plots of Figure 3. The average RMSEs and MCDs between the converted and original viola are shown in Table 2 and Table 3, respectively. The average RMSEs and MCDs between the converted viola and original humming are also presented. In CycleGAN-IC, CycleGAN-IC2, CycleGAN-VC, and CycleGAN-VC2, the RMSE and MCD with the original viola were smaller than the RMSE and MCD with the original humming, meaning the converted viola was more similar to the original viola than the original humming. Therefore, the conversion does go in the right direction. From Table 2 and Table 3, the average RMSE and MCD of CycleGAN-IC2 were the smallest compared with other methods. Hence, the converted audio by CycleGAN-IC2 was the most similar to the original viola.
In addition to objective evaluation, MOS and CMOS subjective evaluations were also performed. For each humming to viola method, ten converted viola sounds were used and 10 listeners attended. The 10 listeners included four men and six women. Among them, three listeners have music backgrounds. Listeners were asked to rate the audio based on its clarity, timbre, and similarity to viola.
The percentage distribution results of MOS scores of CycleGAN-VC, CycleGAN-VC2, CycleGAN-IC, and CycleGAN-IC2 are presented in Table 4. From Table 4, the converted viola by CycleGAN-IC2 obtained the best scores. CycleGAN-VC2 and CycleGAN-IC were the second and third best, respectively. CycleGAN-VC was the worst. Converted viola from CycleGAN-VC, CycleGAN-VC2, CycleGAN-IC, and CycleGAN-IC2 received an average MOS score of 2.42, 2.77, 2.54, and 3.01 respectively.
The voting results of CMOS are as shown in Table 5. In the pair comparison, the listeners voted for the preferred audio. For example, in CycleGAN-IC vs. CycleGAN-IC2 pair, 13% preferred CycleGAN-IC, 60% preferred CycleGAN-IC2, and 27% voted equal. From the table, the viola converted by CycleGAN-IC2 was the most preferred, the viola converted by CycleGAN-VC2 was the second preferred, and the viola converted by CycleGAN-IC and CycleGAN-VC were the third and the worst. This result is consistent with the results of RMSE, MCD, and MOS. From the results of RMSE, MCD, MOS, and CMOS, the conversion from humming to viola by CycleGAN-IC2 was successful.
MIR-QBSH-1 includes singing and humming clips and was used as a training set in this experiment, because, in the beginning, we wished to train a model which could convert both singing and humming to viola. However, the quality of the converted viola from singing by CycleGAN-VC, CycleGAN-VC2, CycleGAN-IC, and CycleGAN-IC2 was not decent. We, therefore, proposed CycleGAN-ICd and perform a singing to viola experiment in the next subsection.

4.4. Experiments Results of Singing to Viola

In this section, we use CycleGAN-ICd and CycleGAN-IC2 for singing-to-viola conversion. Since CycleGAN-ICd uses dual conversion of singing to humming and humming to viola, both singing and humming data are applied. The objective and subjective evaluations are performed and compared.
MIR-QBSH-4, including twenty singing clips, was used in the RMSE and MCD evaluation. The box plots of the RMSE and MCD of the converted and original viola are presented in Figure 4. The average RMSEs and MCDs are shown in Table 6 and Table 7. From Figure 4, Table 6 and Table 7, CycleGAN-ICd performed better than CycleGAN-IC2 in objective evaluation.
In addition to objective evaluation, subjective evaluations were also performed. Ten listeners and ten converted audio clips were used in MOS and CMOS. From Table 8, the percentage distribution of MOS, the quality of the audio converted by CycleGAN-IC2 and CycleGAN-ICd were considered Fair by most of the voters. Converted audio by CycleGAN-IC2 received an average MOS score of 2.97, while converted audio by CycleGAN-ICd received an average MOS score of 3.12. The performance of conversion quality of CycleGAN-ICd was better than CycleGAN-IC2.
In addition to MOS, the CMOS results are shown in Table 9. In CycleGAN-IC2 vs. CycleGAN-ICd pair, 23% preferred CycleGAN-IC2, 56% preferred CycleGAN-ICd, and 21% voted equal. From the MOS and CMOS results, listeners confirmed that the converted audio by CycleGAN-ICd sounded closer to the original viola than CycleGAN-IC2. Therefore, the dual conversion by singing to humming and humming to viola had subjectively better converted audio quality than conversion by singing to viola directly.

5. Conclusions

In this research, we built CycleGAN-based singing/humming to instrument conversion techniques. CycleGAN-IC and CycleGAN-IC2 are based on CycleGAN-VC and CycleGAN-VC2. In the experiment of humming to viola, objective and subjective evaluations show that CycleGAN-IC2 performs better than CycleGAN-IC and the conventional CycleGAN-VC and CycleGAN-VC2, and the converted audio is closer to viola than to humming, and the quality of the converted sound is fair to listeners.
Moreover, we proposed a dual model CycleGAN-ICd, which accomplishes singing to instrument by converting singing to humming and humming to instrument. The objective and subjective tests of singing to viola show that the CycleGAN-ICd outperformed CycleGAN-IC2 and the converted sound received an average MOS score of 3.12.
In the future, in addition to further improving the conversion quality, we will apply this research to other musical instrument conversions.

Author Contributions

Conceptualization, W.-H.L. and Z.-Y.X.; methodology, W.-H.L. and Z.-Y.X.; software, Z.-Y.X. and S.-L.W.; validation, W.-H.L. and S.-L.W.; formal analysis, W.-H.L. and S.-L.W.; investigation, W.-H.L. and S.-L.W.; resources, W.-H.L.; data curation, W.-H.L. and S.-L.W.; writing—original draft preparation, W.-H.L. and S.-L.W.; writing—review and editing, W.-H.L. and S.-L.W.; visualization, Z.-Y.X.; supervision, W.-H.L.; project administration, W.-H.L.; funding acquisition, W.-H.L. All authors have read and agreed to the published version of the manuscript.

Funding

Part of this work was supported by Ministry of Science and Technology, Taiwan under Contract MOST 110-2221-E-992-078-.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The dataset MIR-QBSH used during the current study is available in the repository, http://mirlab.org/dataset/public/, accessed on 29 April 2015.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Chen, Y.; Chu, M.; Chang, E.; Liu, J.; Liu, R. Voice Conversion with Smoothed GMM and MAP Adaptation. In Proceedings of the European Conference on Speech Communication and Technology, Geneva, Switzerland, 1–4 September 2003; pp. 2413–2416. [Google Scholar]
  2. Toda, T.; Black, A.W.; Tokuda, K. Voice Conversion Based on Maximum-Likelihood Estimation of Spectral Parameter Trajectory. IEEE Trans. Audio Speech Lang. Process. 2007, 15, 2222–2235. [Google Scholar] [CrossRef]
  3. Desai, S.; Black, A.W.; Yegnanarayana, B.; Prahallad, K. Spectral Mapping Using Artificial Neural Networks for Voice Conversion. IEEE Trans. Audio Speech Lang. Process. 2010, 18, 954–964. [Google Scholar] [CrossRef]
  4. Chen, L.-H.; Ling, Z.-H.; Liu, L.-J.; Dai, L.-R. Voice Conversion Using Deep Neural Networks With Layer-Wise Generative Training. IEEE/ACM Trans. Audio Speech Lang. Process. 2014, 22, 1859–1872. [Google Scholar] [CrossRef]
  5. Nakashika, T.; Takiguchi, T.; Ariki, Y. Voice Conversion Using RNN Pre-Trained by Recurrent Temporal Restricted Boltzmann Machines. IEEE/ACM Trans. Audio Speech Lang. Process. 2015, 23, 580–587. [Google Scholar] [CrossRef]
  6. Sun, L.; Li, K.; Wang, H.; Kang, S.; Meng, H. Phonetic Posteriorgrams for Many-to-One Voice Conversion without Parallel Data Training. In Proceedings of the 2016 IEEE International Conference on Multimedia and Expo (ICME), Seattle, WA, USA, 11–15 July 2016; pp. 1–6. [Google Scholar]
  7. Liu, L.-J.; Ling, Z.-H.; Jiang, Y.; Zhou, M.; Dai, L.-R. WaveNet Vocoder with Limited Training Data for Voice Conversion. In Proceedings of the Interspeech 2018, ISCA, Hyderabad, India, 2–6 September 2018; pp. 1983–1987. [Google Scholar]
  8. Saito, Y.; Ijima, Y.; Nishida, K.; Takamichi, S. Non-Parallel Voice Conversion Using Variational Autoencoders Conditioned by Phonetic Posteriorgrams and D-Vectors. In Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, 15–20 April 2018; pp. 5274–5278. [Google Scholar]
  9. Kaneko, T.; Kameoka, H. CycleGAN-VC: Non-Parallel Voice Conversion Using Cycle-Consistent Adversarial Networks. In Proceedings of the 2018 26th European Signal Processing Conference (EUSIPCO), Rome, Italy, 3–7 September 2018; pp. 2100–2104. [Google Scholar]
  10. Kaneko, T.; Kameoka, H.; Tanaka, K.; Hojo, N. CycleGAN-VC2: Improved CycleGAN-Based Non-Parallel Voice Conversion. arXiv 2019, arXiv:1904.04631. [Google Scholar]
  11. Fang, F.; Yamagishi, J.; Echizen, I.; Lorenzo-Trueba, J. High-Quality Nonparallel Voice Conversion Based on Cycle-Consistent Adversarial Network. In Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, 15–20 April 2018; pp. 5279–5283. [Google Scholar]
  12. Wang, C.; YU, Y.-B. CycleGAN-VC-GP: Improved CycleGAN-Based Non-Parallel Voice Conversion. In Proceedings of the 2020 IEEE 20th International Conference on Communication Technology (ICCT), Nanning, China, 28–31 October 2020; pp. 1281–1284. [Google Scholar]
  13. Serrà, J.; Pascual, S.; Segura, C. Blow: A Single-Scale Hyperconditioned Flow for Non-Parallel Raw-Audio Voice Conversion. arXiv 2019, arXiv:1906.00794. [Google Scholar]
  14. Deng, C.; Yu, C.; Lu, H.; Weng, C.; Yu, D. Pitchnet: Unsupervised Singing Voice Conversion with Pitch Adversarial Network. In Proceedings of the ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; pp. 7749–7753. [Google Scholar]
  15. Nachmani, E.; Wolf, L. Unsupervised Singing Voice Conversion. arXiv 2019, arXiv:1904.06590. [Google Scholar]
  16. Lu, J.; Zhou, K.; Sisman, B.; Li, H. VAW-GAN for Singing Voice Conversion with Non-Parallel Training Data. arXiv 2020, arXiv:2008.03992. [Google Scholar]
  17. Zhou, K.; Sisman, B.; Liu, R.; Li, H. Seen and Unseen Emotional Style Transfer for Voice Conversion with A New Emotional Speech Dataset. In Proceedings of the ICASSP 2021–2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–11 June 2021; pp. 920–924. [Google Scholar]
  18. Gao, J.; Chakraborty, D.; Tembine, H.; Olaleye, O. Nonparallel Emotional Speech Conversion. In Proceedings of the Interspeech 2019, Graz, Austria, 15–19 September 2019; pp. 2858–2862. [Google Scholar] [CrossRef] [Green Version]
  19. AlBadawy, E.A.; Lyu, S. Voice Conversion Using Speech-to-Speech Neuro-Style Transfer. In Proceedings of the Interspeech 2020, ISCA, Shanghai, China, 25–29 October 2020; pp. 4726–4730. [Google Scholar]
  20. Seshadri, S.; Juvela, L.; Räsänen, O.; Alku, P. Vocal Effort Based Speaking Style Conversion Using Vocoder Features and Parallel Learning. IEEE Access 2019, 7, 17230–17246. [Google Scholar] [CrossRef]
  21. Lian, H.; Hu, Y.; Yu, W.; Zhou, J.; Zheng, W. Whisper to Normal Speech Conversion Using Sequence-to-Sequence Mapping Model With Auditory Attention. IEEE Access 2019, 7, 130495–130504. [Google Scholar] [CrossRef]
  22. Lian, H.; Hu, Y.; Zhou, J.; Wang, H.; Tao, L. Whisper to Normal Speech Based on Deep Neural Networks with MCC and F0 Features. In Proceedings of the 2018 IEEE 23rd International Conference on Digital Signal Processing (DSP), Shanghai, China, 19–21 November 2018; pp. 1–5. [Google Scholar]
  23. O’Connor, B.; Dixon, S.; Fazekas, G. Zero-Shot Singing Technique Conversion. arXiv 2021, arXiv:2111.08839. [Google Scholar]
  24. Biadsy, F.; Weiss, R.J.; Moreno, P.J.; Kanevsky, D.; Jia, Y. Parrotron: An End-to-End Speech-to-Speech Conversion Model and Its Applications to Hearing-Impaired Speech and Speech Separation. arXiv 2019, arXiv:1904.04169. [Google Scholar]
  25. Goodfellow, I.; Bengio, Y.; Courville, A. Deep Learning. MIT Press. 2016. Available online: http://www.deeplearningbook.org (accessed on 4 May 2022).
  26. Haykin, S. Neural Networks and Learning Machines; Pearson Prentice Hall: Hoboken, NJ, USA, 2009. [Google Scholar]
  27. Zhu, J.-Y.; Park, T.; Isola, P.; Efros, A.A. Unpaired Image-to-Image Translation Using Cycle-Consistent Adversarial Networks. arXiv 2018, arXiv:1703.10593. [Google Scholar]
  28. Johnson, J.; Alahi, A.; Fei-Fei, L. Perceptual Losses for Real-Time Style Transfer and Super-Resolution. arXiv 2016, arXiv:1603.08155. [Google Scholar]
  29. Dauphin, Y.N.; Fan, A.; Auli, M.; Grangier, D. Language Modeling with Gated Convolutional Networks. arXiv 2017, arXiv:1612.08083. [Google Scholar]
  30. Li, C.; Wand, M. Precomputed Real-Time Texture Synthesis with Markovian Generative Adversarial Networks. arXiv 2016, arXiv:1604.04382. [Google Scholar]
  31. Lai, W.-H.; Wang, S.-L.; Xu, Z.-Y. Humming-to-Instrument Conversion Based on CycleGAN. In Proceedings of the 2021 International Symposium on Intelligent Signal Processing and Communication Systems (ISPACS), Hualien City, Taiwan, 16–19 November 2021; pp. 1–2. [Google Scholar]
  32. Arjovsky, M.; Chintala, S.; Bottou, L. Wasserstein GAN. arXiv 2017, arXiv:1701.07875. [Google Scholar]
  33. Kodali, N.; Abernethy, J.; Hays, J.; Kira, Z. On Convergence and Stability of GANs. arXiv 2017, arXiv:1705.07215. [Google Scholar]
  34. Kindler, J. A Simple Proof of Sion’s Minimax Theorem. Am. Math. Mon. 2005, 112, 356–358. [Google Scholar] [CrossRef]
  35. Hazan, E.; Singh, K.; Zhang, C. Efficient regret minimization in non-convex games. arXiv 2017, arXiv:1708.00075. [Google Scholar]
  36. Molchanov, P.; Tyree, S.; Karras, T.; Aila, T.; Kautz, J. Pruning Convolutional Neural Networks for Resource Efficient Inference. In Proceedings of the International Conference on Learning Representations (ICLR), Toulon, France, 24–26 April 2017. [Google Scholar]
  37. Jyh-Shing Roger Jang MIR-QBSH Corpus. Available online: http://mirlab.org/dataset/public/MIR-QBSH-corpus.rar (accessed on 29 April 2015).
  38. Cakewalk Inc. Cakewalk—SONAR Family—SONAR Platinum, SONAR Studio and SONAR Artist. Available online: https://www.cakewalk.com/products/SONAR (accessed on 12 June 2021).
  39. P.800: Methods for Subjective Determination of Transmission Quality. Available online: https://www.itu.int/rec/T-REC-P.800-199608-I (accessed on 9 January 2021).
Figure 1. Dual conversion model for singing to instrument.
Figure 1. Dual conversion model for singing to instrument.
Electronics 11 01724 g001
Figure 2. The convergence of the generator (left) and discriminator (right) losses of (a) CycleGAN-VC, (b) CycleGAN-VC2, (c) CycleGAN-IC, and (d) CycleGAN-IC2.
Figure 2. The convergence of the generator (left) and discriminator (right) losses of (a) CycleGAN-VC, (b) CycleGAN-VC2, (c) CycleGAN-IC, and (d) CycleGAN-IC2.
Electronics 11 01724 g002
Figure 3. The box plots of (a) RMSE and (b) MCD for testing set MIR-QBSH-3 in humming to viola.
Figure 3. The box plots of (a) RMSE and (b) MCD for testing set MIR-QBSH-3 in humming to viola.
Electronics 11 01724 g003
Figure 4. The box plots of (a) RMSE and (b) MCD for testing set MIR-QBSH-4 in singing to viola.
Figure 4. The box plots of (a) RMSE and (b) MCD for testing set MIR-QBSH-4 in singing to viola.
Electronics 11 01724 g004
Table 1. (a) The generator architecture. (b) The discriminator architecture.
Table 1. (a) The generator architecture. (b) The discriminator architecture.
(a)
StateAmountOperationArchitecture
Input height = 35 width = length   T number   of   channels = 1
Conv kernel   size = 5 × 15 number   of   channels = 128 stride   size = 1 × 1
GLU---
down-sample (2D)1Conv kernel   size = 5 × 5 number   of   channels = 256 stride   size = 2 × 2
Instance norm---
GLU---
Conv kernel   size = 5 × 5 number   of   channels = 512 stride   size = 2 × 2
Instance norm---
GLU---
2D → 1D1Reshape height = 1 width = length   T / 4 number   of   channels = 2304
1 × 1 Conv kernel   size = 1 × 1 number   of   channels = 256 stride   size = 1 × 1
Instance norm---
residual blocks (1D)6Conv kernel   size = 1 × 3 number   of   channels = 512 stride   size = 1 × 1
Instance norm---
GLU---
Conv kernel   size = 1 × 3 number   of   channels = 256 stride   size = 1 × 1
Instance norm---
Sum---
1D → 2D1 1 × 1 Conv kernel   size = 1 × 1 number   of   channels = 2304 stride   size = 1 × 1
Instance norm---
Reshape height = 9 width = length   T / 4 number   of   channels = 256
up-sample (2D)1Conv kernel   size = 5 × 5 number   of   channels = 1024 stride   size = 1 × 1
Pixel Shuffler---
Instance norm---
GLU---
Conv kernel   size = 5 × 5 number   of   channels = 512 stride   size = 1 × 1
Pixel Shuffler---
Instance norm---
GLU---
Conv kernel   size = 5 × 15 number   of   channels = 35 stride   size = 1 × 1
Output height = 35 width = length   T number   of   channels = 1
(b)
StateAmountOperationArchitecture
Input height = 35 width = 128 number   of   channels = 1
Conv kernel   size = 3 × 3 number   of   channels = 128 stride   size = 1 × 1
GLU---
down-sample (2D)1Conv kernel   size = 3 × 3 number   of   channels = 256 stride   size = 2 × 2
Instance norm---
GLU---
Conv kernel   size = 3 × 3 number   of   channels = 512 stride   size = 2 × 2
Instance norm---
GLU---
Conv kernel   size = 3 × 3 number   of   channels = 1024 stride   size = 2 × 2
Instance norm---
GLU---
Conv kernel   size = 3 × 3 number   of   channels = 1024 stride   size = 1 × 1
Instance norm---
GLU---
Conv kernel   size = 1 × 3 number   of   channels = 1 stride   size = 1 × 1
Real/Fake---
Table 2. The average RMSE of the converted viola with the original viola/humming.
Table 2. The average RMSE of the converted viola with the original viola/humming.
vs. Original Violavs. Original Humming
CycleGAN-VC1.31931.7613
CycleGAN-VC21.25661.7171
CycleGAN-IC1.30851.7206
CycleGAN-IC21.22641.6883
Table 3. The average MCD of the converted viola with the original viola/humming.
Table 3. The average MCD of the converted viola with the original viola/humming.
vs. Original Violavs. Original Humming
CycleGAN-VC4.76657.4907
CycleGAN-VC24.30937.6802
CycleGAN-IC4.52147.8334
CycleGAN-IC24.28847.7419
Table 4. The percentage distribution of MOS in humming to viola.
Table 4. The percentage distribution of MOS in humming to viola.
BadPoorFairGoodExcellent
CycleGAN-VC0%58%42%0%0%
CycleGAN-VC20%25%73%2%0%
CycleGAN-IC0%46%54%0%0%
CycleGAN-IC20%10%79%11%0%
Table 5. The voting results of CMOS in humming to viola.
Table 5. The voting results of CMOS in humming to viola.
CycleGAN-VCCycleGAN-VC2CycleGAN-ICCycleGAN-IC2Equal
CycleGAN-IC: CycleGAN-VC12%-18%-70%
CycleGAN-IC: CycleGAN-VC2-39%24%-37%
CycleGAN-IC2: CycleGAN-VC12%--60%28%
CycleGAN-IC2: CycleGAN-VC2-24%-50%26%
CycleGAN-IC: CycleGAN-IC2--13%60%27%
Table 6. The average RMSE for MIR-QBSH-4 in singing to viola.
Table 6. The average RMSE for MIR-QBSH-4 in singing to viola.
CycleGAN-IC21.1697
CycleGAN-ICd1.0473
Table 7. The average MCD for MIR-QBSH-4 in singing to viola.
Table 7. The average MCD for MIR-QBSH-4 in singing to viola.
CycleGAN-IC24.0608
CycleGAN-ICd3.9803
Table 8. The percentage distribution of MOS in singing to viola.
Table 8. The percentage distribution of MOS in singing to viola.
BadPoorFairGoodExcellent
CycleGAN-IC20%17%73%11%0%
CycleGAN-ICd0%10%68%22%0%
Table 9. The voting results of CMOS in singing to viola.
Table 9. The voting results of CMOS in singing to viola.
CycleGAN-IC2CycleGAN-ICdEqual
CycleGAN-IC2: CycleGAN-ICd23%56%21%
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Lai, W.-H.; Wang, S.-L.; Xu, Z.-Y. CycleGAN-Based Singing/Humming to Instrument Conversion Technique. Electronics 2022, 11, 1724. https://doi.org/10.3390/electronics11111724

AMA Style

Lai W-H, Wang S-L, Xu Z-Y. CycleGAN-Based Singing/Humming to Instrument Conversion Technique. Electronics. 2022; 11(11):1724. https://doi.org/10.3390/electronics11111724

Chicago/Turabian Style

Lai, Wen-Hsing, Siou-Lin Wang, and Zhi-Yao Xu. 2022. "CycleGAN-Based Singing/Humming to Instrument Conversion Technique" Electronics 11, no. 11: 1724. https://doi.org/10.3390/electronics11111724

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop