Transinger: Cross-Lingual Singing Voice Synthesis via IPA-Based Phonetic Alignment
Abstract
1. Introduction
2. Relate Work
2.1. Singing Voice Synthesis
2.2. Audio Codec
2.3. Deep Generative Models
3. Dataset
3.1. Overview
3.2. Symbol
3.3. Alignment
4. Transinger
4.1. Posterior Encoder
4.2. Prior Encoder
4.3. Prior–Posterior Alignment
- Mean Squared Error (MSE) captures the Euclidean distance between the prior and posterior quantizations. By penalizing large deviations, it ensures coarse-grained structural consistency in the latent space—effectively aligning the overall acoustic characteristics (e.g., duration, energy envelope, spectral balance) between the music score and the actual audio. This forms a foundational match before finer distinctions are addressed.
- Cosine Similarity measures the angular difference between latent vectors, thereby promoting alignment in vector direction. In the context of phonetic alignment, this encourages consistency in semantic identity—particularly important for capturing phoneme categories (e.g., vowels vs. consonants) and for supporting IPA-based decomposition strategies, where both symbol and modifier components must align directionally for accurate sub-phoneme modeling.
- L1 loss computes the absolute element-wise differences between the prior and posterior quantizations. Compared to MSE, L1 loss is less sensitive to outliers and extreme deviations. In our singing voice synthesis setting, this loss help stabilize the alignment of phonetic attributes across latent dimensions, particularly under cross-lingual variation and expressive singing conditions. Its element-wise formulation supports more consistent matching of localized phonetic traits (e.g., voicing or articulation), without being dominated by occasional large errors. While not explicitly enforcing phoneme correctness, it can contribute to a more stable learning process in challenging alignment scenarios.
4.4. Generator Network
4.5. Discriminator Network
5. Experiments and Results
5.1. Cross-Language Generalization
- Model 1: Transinger, our proposed multilingual singing voice synthesis model.
- Model 2: VISinger2 [1], a variant of VISinger2, where the prior encoder was adapted to support IPA-based phoneme encoding.
- Corpus 1 for monolingual training: This dataset contains only Chinese-language singing recordings, totaling 15.7 h at a 48 kHz sampling rate.
- Corpus 2 for multilingual training: This dataset includes all languages in our dataset, totaling 40.9 h at a 48 kHz sampling rate.
- Word Error Rate (WER): Using Whisper [48] to transcribe the synthesized audio, we compute the Word Error Rate (WER) by comparing the transcription with the reference text. This metric reflects the pronunciation accuracy and intelligibility of the singing voice. Additionally, words that are written differently but share the same pronunciation are also considered correct.
- F0 Root Mean Square Error (F0_RMSE): This metric quantifies the deviation between the predicted and reference fundamental frequency (F0) values, providing insight into the pitch accuracy and overall prosody of the synthesis.
- Mel Frequency Cepstral Distortion (MCD): MCD is a metric used to evaluate the quality of synthesized speech by quantifying the difference between the mel cepstral coefficients of the synthesized speech and a reference. It measures the spectral similarity between two speech signals, indicating how well the synthesized speech matches the spectral characteristics of the reference.
- Voiced/Silence Error Rate (VS_E): This measures the error rate in determining whether a frame is voiced (i.e., contains vocalization) or silent. It is critical for evaluating the temporal consistency and proper activation of the singing voice.
- Emotional Similarity (ESIM): Emotion2vec [49] as an evaluation metric for emotional similarity. This approach first encodes speech signals into emotion embeddings, then computes the cosine similarity between the embeddings of the reference and generated speech. A higher similarity indicates better alignment in emotional expression.
- Dynamic Range (DR): DR refers to the difference in loudness between the quietest and the loudest parts of an audio signal. A higher dynamic range indicates greater expressiveness and more pronounced amplitude variation.
5.2. Zero-Shot Language Generalization
- IPA-Merged Encoding: Based on the International Phonetic Alphabet (IPA), this method combines letters and diacritics into a single encoding unit, treating each phoneme as an indivisible entity.
- IPA-Disentangled Encoding: This approach separately embeds IPA letters and diacritics, followed by feature fusion.
- Phoneme combinations not present in the training set, where the constituent letters and diacritics exist in different languages within the training samples.
- Phoneme combinations absent from the training set, with one constituent (either letter or diacritic) present in the dataset and the other absent.
- Phoneme combinations entirely unseen by the model, including diacritics never encountered during training.
- Fundamental Frequency Root Mean Square Error (F0-RMSE): Measures the difference between estimated and true fundamental frequency values.
- Mel Spectral Distortion (MSD): Assesses the degree of difference between Mel spectrograms.
- Voiced\Silence Error Rate (VS_E): Calculates the proportion of errors in classifying voiced and silence segments.
- Speaker Similarity Assessment (SA): Quantifies the similarity between synthesized speech and the target speaker.
6. Conclusions
Author Contributions
Funding
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Zhang, Y.; Xue, H.; Li, H.; Xie, L.; Guo, T.; Zhang, R.; Gong, C. VISinger2: High-Fidelity End-to-End Singing Voice Synthesis Enhanced by Digital Signal Processing Synthesizer. In Proceedings of the Interspeech 2023, Dublin, Ireland, 20–24 August 2023; pp. 4444–4448. [Google Scholar] [CrossRef]
- Cho, Y.-P.; Yang, F.-R.; Chang, Y.-C.; Cheng, C.-T.; Wang, X.-H.; Liu, Y.-W. A Survey on Recent Deep Learning-driven Singing Voice Synthesis Systems. In Proceedings of the 2021 IEEE International Conference on Artificial Intelligence and Virtual Reality (AIVR), Taichung, Taiwan, 15–17 November 2021; IEEE Computer Society: Los Alamitos, CA, USA, 2021; pp. 319–323. [Google Scholar] [CrossRef]
- Zhang, Y.; Cong, J.; Xue, H.; Xie, L.; Zhu, P.; Bi, M. VISinger: Variational Inference with Adversarial Learning for End-to-End Singing Voice Synthesis. In Proceedings of the ICASSP 2022—IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–11 June 2021; pp. 7237–7241. [Google Scholar]
- Kenmochi, H.; Ohshita, H. VOCALOID—Commercial Singing Synthesizer Based on Sample Concatenation. In Proceedings of the Interspeech 2007, Antwerp, Belgium, 27–31 August 2007; pp. 4009–4010. [Google Scholar]
- Babu, A.; Wang, C.; Tjandra, A.; Lakhotia, K.; Xu, Q.; Goyal, N.; Singh, K.; von Platen, P.; Saraf, Y.; Pino, J.M.; et al. XLS-R: Self-supervised Cross-lingual Speech Representation Learning at Scale. arXiv 2021, arXiv:2111.09296. [Google Scholar]
- Conneau, A.; Baevski, A.; Collobert, R.; Mohamed, A.; Auli, M. Unsupervised Cross-lingual Representation Learning for Speech Recognition. arXiv 2020, arXiv:2006.13979. [Google Scholar]
- Zhu, J.; Yang, C.; Samir, F.; Islam, J. The taste of IPA: Towards open-vocabulary keyword spotting and forced alignment in any language. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), Mexico City, Mexico, 16–21 June 2024; pp. 750–772. [Google Scholar] [CrossRef]
- Zhou, H.; Lin, Y.; Shi, Y.; Sun, P.; Li, M. Bisinger: Bilingual Singing Voice Synthesis. In Proceedings of the 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Taipei, Taiwan, 16–20 December 2023. [Google Scholar] [CrossRef]
- Wang, X.; Zeng, C.; Chen, J.; Wang, C. Crosssinger: A Cross-Lingual Multi-Singer High-Fidelity Singing Voice Synthesizer Trained on Monolingual Singers. In Proceedings of the 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Taipei, Taiwan, 16–20 December 2023. [Google Scholar] [CrossRef]
- Hwang, J.; Noh, H.; Hong, Y.; Oh, I. X-Singer: Code-Mixed Singing Voice Synthesis via Cross-Lingual Learning. In Proceedings of the Interspeech 2024, Kos, Greece, 1–5 September 2024; pp. 1885–1889. [Google Scholar] [CrossRef]
- Zhang, Y.; Pan, C.; Guo, W.; Li, R.; Zhu, Z.; Wang, J.; Xu, W.; Lu, J.; Hong, Z.; Wang, C.; et al. GTSinger: A Global Multi-Technique Singing Corpus with Realistic Music Scores for All Singing Tasks. In Proceedings of the Thirty-Eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track. 2024. Available online: https://openreview.net/forum?id=z64azPC6Nl (accessed on 25 January 2025).
- Gulati, A.; Qin, J.; Chiu, C.; Parmar, N.; Zhang, Y.; Yu, J.; Han, W.; Wang, S.; Zhang, Z.; Wu, Y.; et al. Conformer: Convolution-augmented Transformer for Speech Recognition. arXiv 2020, arXiv:2005.08100. [Google Scholar]
- Saino, K.; Zen, H.; Nankaku, Y.; Lee, A.; Tokuda, K. An HMM-based singing voice synthesis system. In Proceedings of the Interspeech 2006—ICSLP, Pittsburgh, PA, USA, 17–21 September 2006; pp. 2274–2277. [Google Scholar] [CrossRef]
- Nakamura, K.; Oura, K.; Nankaku, Y.; Tokuda, K. HMM-Based singing voice synthesis and its application to Japanese and English. In Proceedings of the ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing, Florence, Italy, 4–9 May 2014; pp. 265–269. [Google Scholar] [CrossRef]
- Kim, J.; Choi, H.; Park, J.; Hahn, M.; Kim, S.; Kim, J.-J. Korean Singing Voice Synthesis System based on an LSTM Recurrent Neural Network. arXiv. 2018. Available online: https://api.semanticscholar.org/CorpusID:53053431 (accessed on 30 January 2025).
- Hono, Y.; Hashimoto, K.; Oura, K.; Nankaku, Y.; Tokuda, K. Singing Voice Synthesis Based on Generative Adversarial Networks. In Proceedings of the ICASSP 2019—IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK, 12–17 May 2019; pp. 6955–6959. [Google Scholar]
- Chen, J.; Tan, X.; Luan, J.; Qin, T.; Liu, T.-Y. HiFiSinger: Towards High-Fidelity Neural Singing Voice Synthesis. arXiv 2020, arXiv:2009.01776. [Google Scholar]
- Liu, J.; Li, C.; Ren, Y.; Chen, F.; Zhao, Z. DiffSinger: Singing Voice Synthesis via Shallow Diffusion Mechanism. In Proceedings of the AAAI Conference on Artificial Intelligence. 2021. Available online: https://api.semanticscholar.org/CorpusID:235262772 (accessed on 18 February 2025).
- Valin, J.-M.; Vos, K.; Terriberry, T.B. Definition of the Opus Audio Codec. RFC 2012, 6716, 1–326. [Google Scholar]
- Bessette, B.; Salami, R.; Lefebvre, R.; Jelinek, M.; Rotola-Pukkila, J.; Vainio, J.; Mikkola, H.; Järvinen, K. The adaptive multirate wideband speech codec (AMR-WB). IEEE Trans. Speech Audio Process. 2002, 10, 620–636. [Google Scholar] [CrossRef]
- van den Oord, A.; Vinyals, O.; Kavukcuoglu, K. Neural Discrete Representation Learning. In Proceedings of the Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
- Zeghidour, N.; Luebs, A.; Omran, A.; Skoglund, J.; Tagliasacchi, M. SoundStream: An End-to-End Neural Audio Codec. IEEE/ACM Trans. Audio Speech Lang. Process. 2021, 30, 495–507. [Google Scholar] [CrossRef]
- D’efossez, A.; Copet, J.; Synnaeve, G.; Adi, Y. High Fidelity Neural Audio Compression. arXiv 2022, arXiv:2210.13438. [Google Scholar]
- Turk, M.A.; Pentland, A.P. Face recognition using eigenfaces. In Proceedings of the 1991 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Maui, HI, USA, 3–6 June 1991; pp. 586–591. [Google Scholar] [CrossRef]
- Hyvärinen, A.; Karhunen, J.; Oja, E. Independent Component Analysis, 1st ed.; John Wiley & Sons: Hoboken, NJ, USA, 2001; ISBN 9780471405405. [Google Scholar]
- Starner, T.; Pentland, A. Real-time American Sign Language recognition from video using hidden Markov models. In Proceedings of the International Symposium on Computer Vision (ISCV), Coral Gables, FL, USA, 21–23 November 1995; pp. 265–270. [Google Scholar] [CrossRef]
- Ranzato, M.; Mnih, V.; Hinton, G.E. Generating more realistic images using gated MRF’s. In Proceedings of the Advances in Neural Information Processing Systems 23: 24th Annual Conference on Neural Information Processing Systems 2010, Vancouver, BC, Canada, 6–9 December 2010; pp. 2002–2010. [Google Scholar]
- Hinton, G.E. Reducing the Dimensionality of Data with Neural. arXiv. 2008. Available online: https://api.semanticscholar.org/CorpusID:14820265 (accessed on 27 February 2025).
- Kingma, D.P.; Welling, M. Auto-Encoding Variational Bayes. arXiv 2013, arXiv:1312.6114. [Google Scholar]
- Goodfellow, I.J.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.C.; Bengio, Y. Generative Adversarial Nets. In Proceedings of Neural Information Processing Systems. 2014. Available online: https://api.semanticscholar.org/CorpusID:261560300 (accessed on 17 February 2025).
- Sohn, K.; Lee, H.; Yan, X. Learning Structured Output Representation using Deep Conditional Generative Models. In Proceedings of the Neural Information Processing Systems, Montreal, QC, Canada, 7–12 December 2015. [Google Scholar]
- Mirza, M.; Osindero, S. Conditional Generative Adversarial Nets. arXiv 2014, arXiv:1411.1784. [Google Scholar]
- Arjovsky, M.; Chintala, S.; Bottou, L. Wasserstein Generative Adversarial Networks. In Proceedings of the 34th International Conference on Machine Learning, Sydney, NSW, Australia, 6–11 August 2017; pp. 214–223. [Google Scholar] [CrossRef]
- Berthelot, D.; Schumm, T.; Metz, L. BEGAN: Boundary Equilibrium Generative Adversarial Networks. arXiv 2017, arXiv:1703.10717. [Google Scholar]
- Vahdat, A.; Kautz, J. NVAE: A Deep Hierarchical Variational Autoencoder. arXiv 2020, arXiv:2007.03898. [Google Scholar]
- Larsen, A.B.L.; Sønderby, S.K.; Larochelle, H.; Winther, O. Autoencoding beyond pixels using a learned similarity metric. arXiv 2015, arXiv:1512.09300. [Google Scholar]
- Makhzani, A.; Shlens, J.; Jaitly, N.; Goodfellow, I.; Frey, B. Adversarial Autoencoders. arXiv 2016, arXiv:1511.05644. [Google Scholar]
- Bao, J.; Chen, D.; Wen, F.; Li, H.; Hua, G. CVAE-GAN: Fine-Grained Image Generation through Asymmetric Training. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 2764–2773. [Google Scholar]
- Razavi, A.; van den Oord, A.; Vinyals, O. Generating Diverse High-Fidelity Images with VQ-VAE-2. In Proceedings of the Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019. [Google Scholar]
- Rombach, R.; Blattmann, A.; Lorenz, D.; Esser, P.; Ommer, B. High-Resolution Image Synthesis with Latent Diffusion Models. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 10674–10685. [Google Scholar] [CrossRef]
- Shi, J.; Lin, Y.; Bai, X.; Zhang, K.; Wu, Y.; Tang, Y.; Yu, Y.; Jin, Q.; Watanabe, S. Singing Voice Data Scaling-Up: An Introduction to ACE-Opencpop and ACE-KiSing. Interspeech. 2024. Available online: https://api.semanticscholar.org/CorpusID:270225976 (accessed on 21 February 2025).
- Wang, Y.; Wang, X.; Zhu, P.; Wu, J.; Li, H.; Xue, H.; Zhang, Y.; Xie, L.; Bi, M. Opencpop: A High-Quality Open Source Chinese Popular Song Corpus for Singing Voice Synthesis. arXiv 2022, arXiv:2201.07429. [Google Scholar]
- Kim, J.-Y. K-Pop Acts Use More English Lyrics Now, Especially Girl Groups: Study. arXiv. 2023. Available online: https://koreajoongangdaily.joins.com/news/2023-10-31/entertainment/kpop/Kpop-acts-use-more-English-lyrics-now-especially-girl-groups-Study/1902506 (accessed on 23 February 2025).
- McAuliffe, M.; Socolof, M.; Mihuc, S.; Wagner, M.; Sonderegger, M. Montreal Forced Aligner: Trainable Text-Speech Alignment Using Kaldi. In Proceedings of the Interspeech, Stockholm, Sweden, 20–24 August 2017. [Google Scholar]
- Xu, Y.; Chen, H.; Yu, J.; Tan, W.; Gu, R.; Lei, S.; Lin, Z.; Wu, Z. MuCodec: Ultra Low-Bitrate Music Codec. arXiv 2024, arXiv:2409.13216. [Google Scholar]
- Kong, J.; Kim, J.; Bae, J. HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis. arXiv 2020, arXiv:2010.05646. [Google Scholar]
- Jang, W.; Lim, D.C.Y.; Yoon, J.; Kim, B.; Kim, J. UnivNet: A Neural Vocoder with Multi-Resolution Spectrogram Discriminators for High-Fidelity Waveform Generation. In Proceedings of the Interspeech, Brno, Czechia, 30 August–3 September 2021. [Google Scholar]
- Radford, A.; Kim, J.W.; Xu, T.; Brockman, G.; McLeavey, C.; Sutskever, I. Robust Speech Recognition via Large-Scale Weak Supervision. arXiv 2022, arXiv:2212.04356. [Google Scholar]
- Ma, Z.; Zheng, Z.; Ye, J.; Li, J.; Gao, Z.; Zhang, S.; Chen, X. emotion2vec: Self-Supervised Pre-Training for Speech Emotion Representation. In Findings of the Association for Computational Linguistics (ACL); Association for Computational Linguistics: Bangkok, Thailand, 2024. [Google Scholar]
Data Scope | Model | Language | WER ↓ | Pronunciation ↑ | F0_RMSE ↓ | MCD ↓ | VS_E ↓ | ESIM ↑ | DR ↑ | Expressiveness ↑ |
---|---|---|---|---|---|---|---|---|---|---|
Corpus 2 | Model 1 | CN | 22.77 (±2.11)% | 3.69 (± 0.05) | 5.01 | 6.593 | 9.75% | 3.76 | 77.47 | 3.37 (±0.06) |
ALL | 23.98 (±1.94)% | 3.43 (±0.10) | 5.31 | 7.002 | 9.68% | 3.44 | 78.87 | 3.33 (±0.09) | ||
Model 2 | CN | 26.73 (±2.21)% * | 3.59 (± 0.06) * | 6.70 | 6.700 | 10.43% | 3.15 | 80.86 | 3.08 (±0.07) *** | |
ALL | 28.11 (±2.03)% ** | 3.41 (±0.11) ns | 6.88 | 7.016 | 10.52% | 2.67 | 81.69 | 3.07 (±0.09) *** | ||
Corpus 1 | Model 1 | CN | 35.64 (±2.37)% *** | 3.61 (± 0.07) * | 4.454 | 6.526 | 5.56% | 3.36 | 78.62 | 3.26 (±0.09) * |
Model 2 | CN | 38.61 (±2.40)% ns | 3.46(± 0.08) ‡ | 4.812 | 6.651 | 5.92% | 3.22 | 80.87 | 3.02 (±0.11) ‡ | |
Ground Truth | CN | 7.61 (±1.40)% *** | 4.77 (± 0.04) *** | / | / | / | / | 82.66 | 4.68(± 0.08) *** | |
ALL | 8.35 (±1.31)% *** | 4.76 (± 0.05) *** | / | / | / | / | 83.48 | 4.71 (± 0.07) *** |
Data Format | PER | PRR | F0-RMSE | MSD | VS_E | SA |
---|---|---|---|---|---|---|
Merged L&D | 7.4398 | 75.1204 | ||||
Split (L + D) | 6.4204 | 59.2089 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Shen, C.; Zhao, L.; Fu, C.; Gan, B.; Du, Z. Transinger: Cross-Lingual Singing Voice Synthesis via IPA-Based Phonetic Alignment. Sensors 2025, 25, 3973. https://doi.org/10.3390/s25133973
Shen C, Zhao L, Fu C, Gan B, Du Z. Transinger: Cross-Lingual Singing Voice Synthesis via IPA-Based Phonetic Alignment. Sensors. 2025; 25(13):3973. https://doi.org/10.3390/s25133973
Chicago/Turabian StyleShen, Chen, Lu Zhao, Cejin Fu, Bote Gan, and Zhenlong Du. 2025. "Transinger: Cross-Lingual Singing Voice Synthesis via IPA-Based Phonetic Alignment" Sensors 25, no. 13: 3973. https://doi.org/10.3390/s25133973
APA StyleShen, C., Zhao, L., Fu, C., Gan, B., & Du, Z. (2025). Transinger: Cross-Lingual Singing Voice Synthesis via IPA-Based Phonetic Alignment. Sensors, 25(13), 3973. https://doi.org/10.3390/s25133973