FakeMusicCaps: A Dataset for Detection and Attribution of Synthetic Music Generated via Text-to-Music Models
Abstract
1. Introduction
- In this paper, we release FakeMusicCaps, the first dataset specifically designed for both detection and attribution of fake music. The dataset is created using only open-source text-to-music models, making the generation process fully transparent.
- Through the use of simple network architectures, we analyze the detection and (for the first time) attribution of fake music generated via TTM models. We consider both closed-set and open-set classification scenarios, taking into account models generated via Suno in the latter.
2. Problem Formulation
3. FakeMusicCaps Dataset
3.1. Considered Architectures
- TTM02-MusicLDM [10] is a latent diffusion model operating on compressed audio representations extracted via HiFi-GAN [48]. It adapts AudioLDM to the musical domain, by introducing beat-synchronous audio mixup and beat-synchronous latent mixup strategies, to augment the quantity of data used for training. The text conditioning is provided via CLAP [49], which the authors fine-tune on music for a total of 20,000 h. The MusicLDM model is then trained on the Audiostock dataset [49], containing 455.6 h of music.
- TTM03-AudioLDM2 [9] is a latent diffusion model where the audio is compressed via a Variational AutoEncoder (VAE) and HiFiGAN, similarly to the AudioLDM pipeline. However, the major difference of AudioLDM2 with respect to the previous version, is that the diffusion model is conditioned through AudioMAE [50] that enables the adoption of a “Language of Audio”, to generate a wide variety of types of audio. We use the audioldm2-music checkpoint to build FakeMusicCaps, specifically trained for text-to-music generation.
- TTM05-Mustango [12] is a diffusion-based TTM model that through a Music-domain-knowledge-informed UNet (MuNet) injects music concepts such as chord, beats, key, or tempo in the generated music, during the reverse diffusion process. Through data augmentation, the authors generate the MusicBench dataset, composed of 53.168 tracks, to train the model. The model generates at
3.2. Generation Strategy
4. Experimental Analysis
4.1. Dataset
4.2. Baselines
4.3. Training
4.4. Classification Techniques
5. Results
5.1. Closed-Set Performances
5.2. Open Set Performances
5.3. Impact of Window Size
6. Discussion
7. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Briot, J.P.; Hadjeres, G.; Pachet, F.D. Deep Learning Techniques For Music Generation; Springer: Berlin/Heidelberg, Germany, 2020; Volume 1. [Google Scholar]
- Kumar, R.; Seetharaman, P.; Luebs, A.; Kumar, I.; Kumar, K. High-fidelity audio compression with improved rvqgan. In Proceedings of the 37th International Conference on Neural Information Processing Systems, New Orleans, LA, USA, 10–16 December 2023; Volume 36. [Google Scholar]
- Défossez, A.; Copet, J.; Synnaeve, G.; Adi, Y. High Fidelity Neural Audio Compression. arXiv 2022, arXiv:2210.13438. [Google Scholar]
- Agostinelli, A.; Denk, T.I.; Borsos, Z.; Engel, J.; Verzetti, M.; Caillon, A.; Huang, Q.; Jansen, A.; Roberts, A.; Tagliasacchi, M.; et al. Musiclm: Generating music from text. arXiv 2023, arXiv:2301.11325. [Google Scholar]
- Copet, J.; Kreuk, F.; Gat, I.; Remez, T.; Kant, D.; Synnaeve, G.; Adi, Y.; Défossez, A. Simple and controllable music generation. In Proceedings of the 37th International Conference on Neural Information Processing Systems, New Orleans, LA, USA, 10–16 December 2023; Volume 36. [Google Scholar]
- Ziv, A.; Gat, I.; Lan, G.L.; Remez, T.; Kreuk, F.; Défossez, A.; Copet, J.; Synnaeve, G.; Adi, Y. Masked audio generation using a single non-autoregressive transformer. arXiv 2024, arXiv:2401.04577. [Google Scholar]
- Tal, O.; Ziv, A.; Gat, I.; Kreuk, F.; Adi, Y. Joint Audio and Symbolic Conditioning for Temporally Controlled Text-to-Music Generation. arXiv 2024, arXiv:2406.10970. [Google Scholar]
- Liu, H.; Chen, Z.; Yuan, Y.; Mei, X.; Liu, X.; Mandic, D.; Wang, W.; Plumbley, M.D. AudioLDM: Text-to-Audio Generation with Latent Diffusion Models. In Proceedings of the International Conference on Machine Learning, Honolulu, HI, USA, 23–29 July 2023. [Google Scholar]
- Liu, H.; Yuan, Y.; Liu, X.; Mei, X.; Kong, Q.; Tian, Q.; Wang, Y.; Wang, W.; Wang, Y.; Plumbley, M.D. Audioldm 2: Learning holistic audio generation with self-supervised pretraining. IEEE/ACM Trans. Audio Speech Lang. Process 2024, 32, 2871–2883. [Google Scholar] [CrossRef]
- Chen, K.; Wuderak, Y.; Liu, H.; Nezhurina, M.; Berg-Kirkpatrick, T.; Dubnov, S. Musicldm: Enhancing novelty in text-to-music generation using beat-synchronous mixup strategies. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, Republic of Korea, 14–19 April 2024; pp. 1206–1210. [Google Scholar]
- Huang, Q.; Park, D.S.; Wang, T.; Denk, T.I.; Ly, A.; Chen, N.; Zhang, Z.; Zhang, Z.; Yu, J.; Frank, C.; et al. Noise2music: Text-conditioned music generation with diffusion models. arXiv 2023, arXiv:2302.03917. [Google Scholar]
- Melechovsky, J.; Guo, Z.; Ghosal, D.; Majumder, N.; Herremans, D.; Poria, S. Mustango: Toward Controllable Text-to-Music Generation. In Proceedings of the NAACL, Mexico City, Mexico, 16–21 June 2024; Association for Computational Linguistics: Vienna, Austria, 2024; pp. 8293–8316. [Google Scholar] [CrossRef]
- Ronchini, F.; Comanducci, L.; Perego, G.; Antonacci, F. PAGURI: A user experience study of creative interaction with text-to-music models. arXiv 2024, arXiv:2407.04333. [Google Scholar]
- Suno — suno.com. Available online: https://suno.com/ (accessed on 12 September 2024).
- Udio|AI Music Generator—Official Website — udio.com. Available online: https://www.udio.com/ (accessed on 12 September 2024).
- Feffer, M.; Lipton, Z.C.; Donahue, C. DeepDrake ft. BTS-GAN and TayloRVC: An Exploratory Analysis of Musical Deepfakes and Hosting Platforms. In Proceedings of the HCMIR@ ISMIR, Milan, Italy, 5–9 November 2023. [Google Scholar]
- Sha, Z.; Li, Z.; Yu, N.; Zhang, Y. De-fake: Detection and attribution of fake images generated by text-to-image generation models. In Proceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security, Copenhagen, Denmark, 26–30 November 2023; pp. 3418–3432. [Google Scholar]
- Yu, N.; Davis, L.; Fritz, M. Attributing Fake Images to GANs: Learning and Analyzing GAN Fingerprints. In Proceedings of the International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019. [Google Scholar]
- Corvi, R.; Cozzolino, D.; Zingarini, G.; Poggi, G.; Nagano, K.; Verdoliva, L. On the detection of synthetic images generated by diffusion models. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4–10 June 2023; pp. 1–5. [Google Scholar]
- Abady, L.; Wang, J.; Tondi, B.; Barni, M. A siamese-based verification system for open-set architecture attribution of synthetic images. Pattern Recognit. Lett. 2024, 180, 75–81. [Google Scholar] [CrossRef]
- Wißmann, A.; Zeiler, S.; Nickel, R.M.; Kolossa, D. Whodunit: Detection and Attribution of Synthetic Images by Leveraging Model-specific Fingerprints. In Proceedings of the ACM International Workshop on Multimedia AI against Disinformation (MAD), Phuket, Thailand, 10–14 June 2024. [Google Scholar]
- Mandelli, S.; Bestagini, P.; Verdoliva, L.; Tubaro, S. Facing Device Attribution Problem for Stabilized Video Sequences. IEEE Trans. Inf. Forensics Secur. 2019, 15, 14–27. [Google Scholar] [CrossRef]
- Wu, H.; Tseng, Y.; Lee, H.y. CodecFake: Enhancing Anti-Spoofing Models Against Deepfake Audios from Codec-Based Speech Synthesis Systems. In Proceedings of the Interspeech, Kos Island, Greece, 1–5 September 2024. [Google Scholar]
- Salvi, D.; Bestagini, P.; Tubaro, S. Exploring the Synthetic Speech Attribution Problem Through Data-Driven Detectors. In Proceedings of the IEEE International Workshop on Information Forensics and Security (WIFS), Shanghai, China, 12–16 December 2022. [Google Scholar]
- Bhagtani, K.; Bartusiak, E.R.; Yadav, A.K.S.; Bestagini, P.; Delp, E.J. Synthesized Speech Attribution Using The Patchout Spectrogram Attribution Transformer. In Proceedings of the ACM Workshop on Information Hiding and Multimedia Security (IH&MMSec), Chicago, IL, USA, 28–30 June 2023. [Google Scholar]
- Zang, Y.; Zhang, Y.; Heydari, M.; Duan, Z. Singfake: Singing voice deepfake detection. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, Republic of Korea, 14–19 April 2024; pp. 12156–12160. [Google Scholar]
- Xie, Y.; Zhou, J.; Lu, X.; Jiang, Z.; Yang, Y.; Cheng, H.; Ye, L. FSD: An initial chinese dataset for fake song detection. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, Republic of Korea, 14–19 April 2024; pp. 4605–4609. [Google Scholar]
- Chen, X.; Wu, H.; Jang, J.S.R.; Lee, H.y. Singing Voice Graph Modeling for SingFake Detection. In Proceedings of the Interspeech, Kos Island, Greece, 1–5 September 2024. [Google Scholar]
- Desblancs, D.; Meseguer-Brocal, G.; Hennequin, R.; Moussallam, M. From Real to Cloned Singer Identification. In Proceedings of the 25th International Society for Music Information Retrieval Conference, San Francisco, CA, USA, 10–14 November 2024. [Google Scholar]
- Guragain, A.; Liu, T.; Pan, Z.; Sailor, H.B.; Wang, Q. Speech Foundation Model Ensembles for the Controlled Singing Voice Deepfake Detection (CtrSVDD) Challenge 2024. In Proceedings of the 2024 IEEE Spoken Language Technology Workshop, Macao, China, 2–5 December 2024. [Google Scholar]
- Yizhi, L.; Yuan, R.; Zhang, G.; Ma, Y.; Chen, X.; Yin, H.; Xiao, C.; Lin, C.; Ragni, A.; Benetos, E.; et al. MERT: Acoustic music understanding model with large-scale self-supervised training. In Proceedings of the The Twelfth International Conference on Learning Representations, Singapore, 24–28 April 2023. [Google Scholar]
- Baevski, A.; Zhou, Y.; Mohamed, A.; Auli, M. wav2vec 2.0: A framework for self-supervised learning of speech representations. Adv. Neural Inf. Process. Syst. 2020, 33, 12449–12460. [Google Scholar]
- Li, Y.; Milling, M.; Specia, L.; Schuller, B.W. From Audio Deepfake Detection to AI-Generated Music Detection–A Pathway and Overview. arXiv 2024, arXiv:2412.00571. [Google Scholar]
- Afchar, D.; Meseguer-Brocal, G.; Hennequin, R. AI-Generated Music Detection and its Challenges. In Proceedings of the ICASSP 2025—2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Kothaguda, India, 6–11 April 2025; pp. 1–5. [Google Scholar]
- Wei, Z.; Ye, D.; Deng, J.; Lin, Y. From voices to beats: Enhancing music deepfake detection by identifying forgeries in background. In Proceedings of the ICASSP 2025—2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Kothaguda, India, 6–11 April 2025; pp. 1–5. [Google Scholar]
- Tak, H.; Patino, J.; Todisco, M.; Nautsch, A.; Evans, N.; Larcher, A. End-to-end anti-spoofing with rawnet2. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–11 June 2021. [Google Scholar]
- Jung, J.w.; Heo, H.S.; Tak, H.; Shim, H.j.; Chung, J.S.; Lee, B.J.; Yu, H.J.; Evans, N. AASIST: Audio Anti-Spoofing Using Integrated Spectro-Temporal Graph Attention Networks. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 22–27 May 2022. [Google Scholar]
- Gemmeke, J.F.; Ellis, D.P.; Freedman, D.; Jansen, A.; Lawrence, W.; Moore, R.C.; Plakal, M.; Ritter, M. Audio set: An ontology and human-labeled dataset for audio events. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, USA, 5–9 March 2017; pp. 776–780. [Google Scholar]
- Manco, I.; Weck, B.; Doh, S.; Won, M.; Zhang, Y.; Bogdanov, D.; Wu, Y.; Chen, K.; Tovstogan, P.; Benetos, E.; et al. The Song Describer Dataset: A Corpus of Audio Captions for Music-and-Language Evaluation. In Proceedings of the Machine Learning for Audio Workshop at NeurIPS, New Orleans, LA, USA, 10–16 December 2023. [Google Scholar]
- Roy, A.; Liu, R.; Lu, T.; Herremans, D. JamendoMaxCaps: A Large-Scale Music-Caption Dataset with Imputed Metadata. arXiv 2025, arXiv:2502.07461. [Google Scholar]
- Chu, Y.; Xu, J.; Yang, Q.; Wei, H.; Wei, X.; Guo, Z.; Leng, Y.; Lv, Y.; He, J.; Lin, J.; et al. Qwen2-audio technical report. arXiv 2024, arXiv:2407.10759. [Google Scholar]
- Evans, Z.; Parker, J.D.; Carr, C.; Zukowski, Z.; Taylor, J.; Pons, J. Stable Audio Open. arXiv 2024, arXiv:2407.14358. [Google Scholar]
- Civit, M.; Drai-Zerbib, V.; Lizcano, D.; Escalona, M.J. SunoCaps: A novel dataset of text-prompt based AI-generated music with emotion annotations. Data Brief 2024, 55, 110743. [Google Scholar] [CrossRef] [PubMed]
- Rahman, M.A.; Hakim, Z.I.A.; Sarker, N.H.; Paul, B.; Fattah, S.A. SONICS: Synthetic Or Not—Identifying Counterfeit Songs. In Proceedings of the Thirteenth International Conference on Learning Representations, Las Vegas, NV, USA, 11–13 August 2025. [Google Scholar]
- Li, Y.; Sun, Q.; Li, H.; Specia, L.; Schuller, B.W. Detecting Machine-Generated Music with Explainability–A Challenge and Early Benchmarks. arXiv 2024, arXiv:2412.13421. [Google Scholar]
- Li, Y.; Li, H.; Specia, L.; Schuller, B.W. M6: Multi-generator, Multi-domain, Multi-lingual and cultural, Multi-genres, Multi-instrument Machine-Generated Music Detection Databases. arXiv 2024, arXiv:2412.06001. [Google Scholar]
- Kim, C.D.; Kim, B.; Lee, H.; Kim, G. Audiocaps: Generating captions for audios in the wild. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, MN, USA, 2–7 June 2019; Volume 1 (Long and Short Papers), pp. 119–132. [Google Scholar]
- Kong, J.; Kim, J.; Bae, J. Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis. Adv. Neural Inf. Process. Syst. 2020, 33, 17022–17033. [Google Scholar]
- Wu, Y.; Chen, K.; Zhang, T.; Hui, Y.; Berg-Kirkpatrick, T.; Dubnov, S. Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Ialysos, Greece, 4–10 June 2023; pp. 1–5. [Google Scholar]
- Huang, P.Y.; Xu, H.; Li, J.; Baevski, A.; Auli, M.; Galuba, W.; Metze, F.; Feichtenhofer, C. Masked autoencoders that listen. Adv. Neural Inf. Process. Syst. 2022, 35, 28708–28720. [Google Scholar]
- Evans, Z.; Parker, J.D.; Carr, C.; Zukowski, Z.; Taylor, J.; Pons, J. Long-form music generation with latent diffusion. arXiv 2024, arXiv:2404.10301. [Google Scholar]
- Raffel, C.; Shazeer, N.; Roberts, A.; Lee, K.; Narang, S.; Matena, M.; Zhou, Y.; Li, W.; Liu, P.J. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. 2020, 21, 1–67. [Google Scholar]
- Dai, W.; Dai, C.; Qu, S.; Li, J.; Das, S. Very deep convolutional neural networks for raw waveforms. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, USA, 5–9 March 2017; pp. 421–425. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
- Hendrycks, D.; Gimpel, K. A Baseline for Detecting Misclassified and Out-of-Distribution Examples in Neural Networks. In Proceedings of the International Conference on Learning Representations, Toulon, France, 24–26 April 2017. [Google Scholar]
- Kelleher, J.D.; Mac Namee, B.; D’arcy, A. Fundamentals of Machine Learning for Predictive Data Analytics: Algorithms, Worked Examples, and Case Studies; MIT Press: Cambridge, MA, USA, 2020. [Google Scholar]
- Sridhar, S.; Cartwright, M. Multi-Label Open-Set Audio Classification. In Proceedings of the 8th Detection and Classification of Acoustic Scenes and Events 2023 Workshop (DCASE2023), Tampere, Finland, 20–22 September 2023; pp. 171–175. [Google Scholar]
- You, J.; Wu, W.; Lee, J. Open set classification of sound event. Sci. Rep. 2024, 14, 1282. [Google Scholar] [CrossRef] [PubMed]
Model | Precision | Recall | F1 Score | |
---|---|---|---|---|
M5 | 0.90 | 0.90 | 0.90 | 0.90 |
RawNet2 | 0.88 | 0.89 | 0.88 | 0.88 |
ResNet18 + Spec |
Model | Precision | Recall | F1 Score | |
---|---|---|---|---|
M5 | 0.76 | 0.76 | 0.76 | 0.75 |
RawNet2 | 0.75 | 0.75 | 0.75 | 0.74 |
ResNet18 + Spec | 0.85 | 0.78 | 0.85 | 0.8 |
Model | Precision | Recall | F1 Score | |
---|---|---|---|---|
M5 | 0.42 | 0.67 | 0.42 | 0.48 |
RawNet2 | 0.47 | 0.80 | 0.47 | 0.52 |
ResNet18 + Spec | 0.48 | 0.80 | 0.48 | 0.56 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Comanducci, L.; Bestagini, P.; Tubaro, S. FakeMusicCaps: A Dataset for Detection and Attribution of Synthetic Music Generated via Text-to-Music Models. J. Imaging 2025, 11, 242. https://doi.org/10.3390/jimaging11070242
Comanducci L, Bestagini P, Tubaro S. FakeMusicCaps: A Dataset for Detection and Attribution of Synthetic Music Generated via Text-to-Music Models. Journal of Imaging. 2025; 11(7):242. https://doi.org/10.3390/jimaging11070242
Chicago/Turabian StyleComanducci, Luca, Paolo Bestagini, and Stefano Tubaro. 2025. "FakeMusicCaps: A Dataset for Detection and Attribution of Synthetic Music Generated via Text-to-Music Models" Journal of Imaging 11, no. 7: 242. https://doi.org/10.3390/jimaging11070242
APA StyleComanducci, L., Bestagini, P., & Tubaro, S. (2025). FakeMusicCaps: A Dataset for Detection and Attribution of Synthetic Music Generated via Text-to-Music Models. Journal of Imaging, 11(7), 242. https://doi.org/10.3390/jimaging11070242