FakeMusicCaps: A Dataset for Detection and Attribution of Synthetic Music Generated via Text-to-Music Models

Luca Comanducci; Paolo Bestagini; Stefano Tubaro

doi:10.3390/jimaging11070242

,

and

Department of Electronics, Information and Bioengineering (DEIB), Politecnico di Milano, 20133 Milano, Italy

^*

Author to whom correspondence should be addressed.

J. Imaging2025, 11(7), 242;https://doi.org/10.3390/jimaging11070242

This article belongs to the Special Issue Advancements in Deepfake Technology, Biometry System and Multimedia Forensics

Version Notes

Order Reprints

Abstract

Text-to-music (TTM) models have recently revolutionized the automatic music generation research field, specifically by being able to generate music that sounds more plausible than all previous state-of-the-art models and by lowering the technical proficiency needed to use them. For these reasons, they have readily started to be adopted for commercial uses and music production practices. This widespread diffusion of TTMs poses several concerns regarding copyright violation and rightful attribution, posing the need of serious consideration of them by the audio forensics community. In this paper, we tackle the problem of detection and attribution of TTM-generated data. We propose a dataset, FakeMusicCaps, that contains several versions of the music-caption pairs dataset MusicCaps regenerated via several state-of-the-art TTM techniques. We evaluate the proposed dataset by performing initial experiments regarding the detection and attribution of TTM-generated audio considering both closed-set and open-set classification.

Keywords:

music generation; text-to-music; audio forensics; DeepFake

1. Introduction

Deep learning-based music generation [1] has been recently revolutionized by the introduction of text-to-music models. TTM models are usually based on a language model that decodes continuous or discrete tokenized embeddings obtained via some neural audio codec [2,3], such as MusicLM [4], MusicGEN [5], MAGNeT [6], and JASCO [7], or on latent diffusion models operating on some compressed form of audio, such as AudioLDM [8], AudioLDM2 [9], MusicLDM [10], Noise2Music [11], and Mustango [12].

These models are characterized by being able to generate sufficiently realistic music and are simple to use, lowering the technical proficiency needed to successfully interact with them [13]. This combination of factors has made them extremely attractive to the general public and of interest to private industries.

Several commercial TTMs have been proposed, such as Suno [14] (setting the record for the biggest investment ever in an AI music startup, namely $125 million) and Udio [15]. Recently, both companies have been sued by major record companies and have consecutively admitted to copyright infringement, by training their respective models also using unlicensed music. As both the capabilities and commercial interest of these models grow, it is becoming increasingly necessary to begin to develop forensic approaches to be able to detect and analyze music generated via TTMs [16].

The multimedia forensics research field is well mature in image [17,18,19,20,21] and video [22] deepfake detection and model attribution. Concerning audio, forensics approaches have been focusing almost exclusively on speech signals [23,24,25].

In the music domain, most efforts have focused on singing voice detection [26,27,28,29] with the development of specific challenges [30] to foster research in this direction. More specifically, in [26], the authors present SingFake, a dataset for singing voice detection and use it to evaluate four state-of-the-art speech deepfake detectors. In [27], the authors present a dataset of fake chinese songs and analyze the performance of SOTA detectors trained on speech signals and on the proposed dataset. The SingGraph model [28] leverages MERT [31] and wav2vec 2.0 [32] to detect fake singing voices by merging lyrics and audio analysis, while in [29], the authors use singer-level contrastive learning and demonstrate difficulties in detecting cloned-singers.

More recently, other works have tackled the fake music research problem, considering also non-vocal audio parts. An overview of the problem is presented in [33]. In [34], the authors focus on detecting which neural audio codec was used to compress real music tracks and, obtaining surprisingly accurate results, point out several issues that might make the fake music detection problem too easy and rapidly worsen in out-of-domain-scenarios. Moreover, as a detector, they apply a simple convolutional model composed of just six layers, showing that performance is not the only thing to take into account when considering fake music detection. Singing voice is again considered in [35], where it is demonstrated how including background music could enhance the accuracy when classifying fake singing voices. Specifically, the authors apply a hybrid front-end model that extracts features separately from vocals and background music, before feeding the output to a backend network based on Rawnet2 [36] and AASIST [37].

Research in this field is also limited by economic factors, since most models are developed by tech giants that often do not release the code and/or weights. Additionally, available paired text-music datasets are scarce. Notable example datasets include MusicCaps [4], containing 5500 musiclips extracted from AudioSet [38] and annotated by human musicians, Song Describer [39], containing 1100 human-made captions of 706 music recordings, MusicBench [12], which contains music obtained by augmenting and modifying the MusicCaps dataset, obtaining 52,000 samples, and the recently proposed JamendoMaxCaps [40], where 362,000 captions from the Jamendo dataset are described using an audio captioning system [41].

In this paper, we propose the FakeMusicCaps dataset, with the objective of encouraging research in the detection of music deepfakes. To build FakeMusicCaps, we replicate the MusicCaps dataset by using its captions as input to five state-of-the-art TTM models, namely MusicGen [5], MusicLDM [10], AudioLDM2 [9], Stable Audio Open [42], and Mustango [12]. The nature of FakeMusicCaps makes it easy to incorporate future TTMs by simply generating music examples using the same procedure. We perform a simple benchmark study, on FakeMusicCaps, by studying if it is possible to perform detection and attribution, i.e., classifying the input music as either real or belonging to one of the chosen TTM models, using state-of-the-art models. We analyze how the models perform both in closed set and open set scenarios, where the latter also includes data belonging to generators not seen during training, specifically, belonging to the SunoCaps dataset [43].

At the same time of this work, a similar dataset, named SONICS, has been proposed [44], which however considers only the commercially-available models Suno and Udio and performs only real/fake music detection. More specifically, in [44] the authors explore the classification of real and fake music by proposing the Spectro-Temporal Tokens Transformer (SpecTTTra) architecture to perform fake music detection.

We instead focus on open-source TTMs and consider commercial ones only in the open set classification. The reasoning behind this is that open-source techniques are possibly available to a wider part of the research community, which could use them to integrate FakeMusicCaps as they see fit. Moreover, open-source models allow researchers to have complete knowledge of the entire pipeline used to generate the audio tracks, enabling them to make stronger assumptions on the behaviors observed when classifying the data. This is particularly useful when dealing with open-set scenarios, since without knowing the audio generation process it is impossible to deem audio signals as belonging to different generators.

Since its release, the FakeMusicCaps dataset has already been used to foster the research in fake music detection. In fact, in [45], the authors performed a study aimed at understanding the behavior of classifiers operating on the FakeMusicCaps dataset, by applying eXplainable Artificial Intelligence (XAI) techniques. Inspired by FakeMusicCaps and SONICS, the authors of [46] propose the M6 dataset, which aggregates various types of audio content from existing datasets for what concerns real music and generates fake music using simple custom prompts.

Our contributions can then be summarized as follows:

In this paper, we release FakeMusicCaps, the first dataset specifically designed for both detection and attribution of fake music. The dataset is created using only open-source text-to-music models, making the generation process fully transparent.
Through the use of simple network architectures, we analyze the detection and (for the first time) attribution of fake music generated via TTM models. We consider both closed-set and open-set classification scenarios, taking into account models generated via Suno in the latter.

The remainder of the paper is organized as follows. In Section 2, we introduce the attribution problem for TTM models. In Section 3, we describe how the FakeMusicCaps dataset was created. Section 4 presents the experimental setup used to conduct the experiments, while in Section 5, we present the results to analyze the complexity of TTM attribution and the effectiveness of the proposed dataset. Finally, in Section 7, we draw some conclusions. The code used to generate FakeMusicCaps and perform the experiments (https://github.com/polimi-ispl/FakeMusicCaps, accessed on 13 July 2025) as well as the complete dataset (https://zenodo.org/records/15063698, accessed on 13 July 2025) are publicly available.

2. Problem Formulation

Given some kind of text representation

τ

and a composite model

T (\cdot)

, the TTM techniques model the function

x = T (τ)

, where

x \in R^{1 \times N}

is an audio waveform containing music that corresponds to the textual description provided in

τ

.

The text-to-music attribution problem, schematically shown in Figure 1, can be formally defined as follows. Given the discrete-time music signal

x

and a set of I TTM models

{T_{0}, \dots, T_{I - 1}}

, the objective is to determine which generator

T_{i}

has been used to generate

x

. This is done by training a classifier that takes as input

x

and outputs the probabilities

p_{i}, i = 0, \dots, I - 1

of

x

generated using each of the known TTM models.

Figure 1. Schematic representation of the text-to-music attribution problem.

The attribution problem is often considered both in closed- and open-set scenarios. In the former, all generators are seen both during training and testing, while in the latter, some TTMs are unknown during training and seen only at testing time, posing the need to develop specific classification strategies.

3. FakeMusicCaps Dataset

In this section, we describe how the FakeMusicCaps dataset was created, first by presenting the chosen TTM models and then describing the generation procedure. We schematically represent the creation of the FakeMusicCaps dataset in Figure 2.

Figure 2. Flowchart summarizing the creation of the FakeMusicCaps dataset.

To create FakeMusicCaps, we take the MusicCaps evaluation dataset [4] and use its captions as input to five state-of-the-art TTM models, which we will describe in the following. The original audio files contained in the MusicCaps dataset are actual music clips extracted from the AudioSet dataset [47] and described using a free text caption by musicians. The characteristics of the audio content in terms of musical instruments, timbre, and variety, the FakeMusicCaps dataset depends on the dataset on which each TTM model was trained, and on the peculiarity of each of these models.

3.1. Considered Architectures

In this section, we present an overview of the architectures (TTM01-TTM05) used to create the FakeMusicCaps dataset.

TTM01-MusicGen [5] is an autoregressive language model, based on a single-stage transformer that decodes discrete audio tokens obtained via Encodec [3]. It was trained over an undisclosed dataset of over 20,000 h of music. We use the medium checkpoint consisting of 1.5 B parameters.
TTM02-MusicLDM [10] is a latent diffusion model operating on compressed audio representations extracted via HiFi-GAN [48]. It adapts AudioLDM to the musical domain, by introducing beat-synchronous audio mixup and beat-synchronous latent mixup strategies, to augment the quantity of data used for training. The text conditioning is provided via CLAP [49], which the authors fine-tune on music for a total of 20,000 h. The MusicLDM model is then trained on the Audiostock dataset [49], containing 455.6 h of music.
TTM03-AudioLDM2 [9] is a latent diffusion model where the audio is compressed via a Variational AutoEncoder (VAE) and HiFiGAN, similarly to the AudioLDM pipeline. However, the major difference of AudioLDM2 with respect to the previous version, is that the diffusion model is conditioned through AudioMAE [50] that enables the adoption of a “Language of Audio”, to generate a wide variety of types of audio. We use the audioldm2-music checkpoint to build FakeMusicCaps, specifically trained for text-to-music generation.
TTM04-Stable Audio Open [42] is a latent-diffusion architecture generating stereo data at $44.1 kHz$ based on a variant of Stable Audio [51] that uses T5 [52] as a text encoder. The model is trained only on Creative Commons-licensed audio data for a total of 7.3 K hours of audio.
TTM05-Mustango [12] is a diffusion-based TTM model that through a Music-domain-knowledge-informed UNet (MuNet) injects music concepts such as chord, beats, key, or tempo in the generated music, during the reverse diffusion process. Through data augmentation, the authors generate the MusicBench dataset, composed of 53.168 tracks, to train the model. The model generates at $16 kHz$

3.2. Generation Strategy

In this section, we describe the strategy used to generate the FakeMusicCaps dataset.

We derive the inspiration for the generation procedure from the MusicCaps [4] dataset, consisting of 5500 music clips, each 10 s long, extracted from AudioSet [38]. Each track is supplied with an annotation by a professional musician. MusicCaps has rapidly become the benchmark dataset for the evaluation of TTM models.

To create FakeMusicCaps, we use the captions from MusicCaps, and for each one of them, we generate a corresponding 10 s audio track using models (TTM01-TTM05) for a total of 27,605 music tracks corresponding to almost 77 h.

Since the objective of the dataset is to provide an initial dataset for the analysis of the detection and attribution of music generated via TTM models, we adopt an audio pipeline that ensures that all audios are represented using the same format. Specifically, each track is first converted to mono and downsampled to the sampling frequency

F_{s} = 16 kHz

. Finally, we save each track using the 32-bit float wav format.

4. Experimental Analysis

In this section, we describe the experiments performed with the aim of showing a first validation of the FakeMusicCaps dataset, considering both closed set and open set scenarios.

4.1. Dataset

We used the FakeMusicCaps dataset during the training and test procedures. We made sure that the training and test datasets were disjoint. Specifically, we built the test set by selecting 320 tracks from FakeMusicCaps, as a selection criterion, for each TTM model, we chose those having the same captions as the SunoCaps [43] dataset. This choice was operated in order to be able to coherently use the Suno-generated music excerpts from SunoCaps to perform the open-set scenario experiments.

4.2. Baselines

We use three classification models as simple benchmarks of the FakeMusicCaps for deepfake music detection and attribution.

We first consider a very simple network that operates on raw audio, namely M5 [53]. This network consists of only 0.5 M parameters and leverages the adoption of several consecutive layers. We use this simple network as an initial experiment, in order to understand the level of hardness of the music deepfake detection and attribution problem.

Then, we selected a more complicated method operating on raw audio, namely RawNet2 [36]. This model, is an end-to-end neural network that has been used as a baseline for several antispoofing challenges such as ASVspoof 2021 and consists of Sinc Filters, followed by residual blocks and a Gated Recurrent Unit (GRU).

We also consider a model operating on log-spectrograms, namely ResNet18+Spec [54]. This model is a modified version of ResNet18, consisting of 18-layer deep convolutional layer with residual connections. The modifications make it suitable to work with 1-channel log-spectrograms.

All methods were modified by adding a fully connected layer with a number of neurons corresponding to the number of considered classes at the end of each network. This modification is necessary in order to be able to discriminate correctly between the considered TTM models.

4.3. Training

All models were trained to discriminate between 6 different classes, comprising the 5 known TTM models and the real music signals belonging to MusicCaps.

We trained all models using cross-entropy as a loss function and the Adam optimizer with a learning rate of 1 × 10⁻⁴.

All networks were trained for a maximum of 100 epochs, ending the training earlier if the loss did not improve for more than 10 consecutive epochs. We used a batch size of 32 for M5 and 16 for RawNet2 and ResNet18 + Spec. In the case of ResNet18 + Spec, we computed the STFT with 512 frequency points, using a Hann window of length 512 samples with a hop size of 128 samples.

4.4. Classification Techniques

In the closed set classification problem, given a raw audio waveform corresponding to music, we want to identify the generation method from the set of known (i.e., a set of models included in the training dataset) TTM models.

Differently, in the case of open set classification, we want also to determine if some audio tracks belong to a TTM model that it is unknown, i.e., not included in the training dataset.

If we consider

p_{i}

as the output of the softmax layer of the models, then in the closed set case, class attribution can simply be performed by computing

arg {max}_{i} p_{i}

. For open set classification, instead, we follow two different approaches. In the open set (threshold) technique [55], we compute the two highest values of

p_{i}

, defined as

p_{1}

and

p_{2}

, and then classify the input example as unknown if the ratio

p_{1} / p_{2}

between these values is higher than a threshold. The rationale is that, if the TTM method used to generate the audio track was known from the training set, only one

p_{i}

value should be high. More formally, the predicted TTM model

\hat{T}

can be obtained as

\hat{T} = \{\begin{matrix} arg max_{i} p_{i} & if \frac{p_{1}}{p_{2}} > γ, \\ I + 1, & otherwise, \end{matrix}

where

γ

is a threshold that should be determined empirically, following [24], we choose

γ = 2

.

In the open set SVM technique, instead, we train a one-class Support Vector Machine (SVM), using radial basis functions kernel, over the

p_{i}

values computed from the training data. The output of the classification is binary: either the class is known or not.

5. Results

In this section, we present preliminary results aimed at demonstrating the suitability of FakeMusicCaps as an initial dataset for text-to-music model detection and attribution. Specifically, we test the performance of state-of-the art models for fake music detection and attribution, both in the closed- and open-set scenarios. Then we perform an additional experiment aimed at understanding the impact of the length of the considered audio tracks.

More specifically, in Section 5.1, we consider the simpler scenario where the TTM seen during training and test phases are the same. In Section 5.2, instead, we consider the more challenging and realistic open-set scenario where the detection models are trained on TTM01, TTM02, TTM03, TTM03, TTM04, and TTM05 and are tested on the SunoCaps [43] dataset, whose audio files are generated using the commercial TTM model Suno, which is not used to generate audio files used during training.

5.1. Closed-Set Performances

Despite closed-set classification on a single dataset is often considered a trivial task in forensic applications, it is worth investigating the performance of the tested methods in this scenario.

Table 1 reports closed-set classification results in terms of balanced accuracy

{ACC}_{B}

[56], Precision, Recall, and F1 Score. Additionally, the left column of Figure 3 shows the confusion matrix corresponding to M5, RawNet2 and ResNet18+Spec, respectively.

Table 1. Closed-set classification performances.

Figure 3. Confusion matrices of M5 (top), RawNet2 (middle) and ResNet+Spec (bottom) in the three classification scenarios.

In all metrics, ResNet18 + Spec provides the best performance, while RawNet2 obtains slightly worse results than M5. From the inspection of the confusion matrices, we can see that ResNet18 slightly confounds TTM03 (AudioLDM2) with TTM05 (Mustango), it is interesting to notice that both are diffusion-based models. M5 has a slightly lower performance in detecting the ground-truth data, while RawNet2 struggles more to detect model TTM02 (MusicLDM).

5.2. Open Set Performances

We show in Table 2 the performance obtained when performing open set classification using the threshold approach. The corresponding confusion matrices are shown in the second column of Figure 3.

Table 2. Open set (Threshold) classification performances.

Results corresponding to the open set classification using the SVM approach are shown in Table 3. The corresponding confusion matrices are shown in the last column of Figure 3.

Table 3. Open set (SVM) classification performances.

As expected, the open-set scenario is much more challenging than the closed-set one for all the classification models considered. If we look at the results reported in Table 2 and Table 3, we can see that again ResNet18 + Spec achieves the best performance in both cases and that the results obtained via the SVM technique are much worse than the ones obtained via the thresholding approach. However, the analysis of the results becomes different if we look at the confusion matrices. When considering the thresholding method (corresponding to the middle column), we can see that ResNet18 + Spec obtains the best performance when classifying the known methods, but misclassifies all audio excerpts belonging to the class not seen during training and named UNKNWN in the image. Interestingly enough, these are confounded with the real examples, which is somewhat expected, given that the commercial model Suno is probably the most realistic of the considered TTM models. M5 and RawNet2 obtain somehow a similar performance, with the former confounding UNKNWN examples with real and MusicGen-generated ones, while the latter mostly confounding them with MusicGen.

In the case of the SVM open-set approach, all models behave differently. Approximately half the time, the classification models mistake the known TTM techniques for the unknown one. Interestingly, RawNet2 obtains the highest accuracy of

0.82

for what concerns the unknown class, and even in this case, ResNet18 mistakes it for the real one.

While more complicated techniques for open-set classification could be used [57,58], the results included here are only intended to provide an initial benchmark of tackling the fake music detection problem on the FakeMusicCaps dataset. More complicated approaches will be considered in future works.

5.3. Impact of Window Size

We also performed a small experiment to verify how much the impact of the temporal window length used as input to the models changes their performance. This is important, especially for the design of further datasets, i.e., do we need to create longer musical excerpts or not?

We consider four window lengths, namely

10 s

,

7.5 s

, 5

s

, and

2.5 s

and report the results in terms of balanced accuracy in Figure 4. As we can see, the variations in accuracy are not extreme in all classification scenarios. M5 seems to have an increase in accuracy passing from

7.5

to 10 s window length for both closed set and open set (threshold) methods. ResNet18+Spec does not have major improvements, while a slight increase in accuracy is seen for RawNet2. Results in the case of the Open set (SVM) show a less clear trend, but the impact of the window size does not seem to be relevant even in this case.

Figure 4. Balanced accuracy varying according to the considered window size using M5 (

), RawNet2 (

), and ResNet + Spec (

).

6. Discussion

The objective of the results provided in this paper is to present a first approach to the TTM model detection and attribution and do not claim at all to be definitive. Instead, we hope to further motivate research in this direction. New TTM models are proposed almost monthly if not daily, with a continuous increase in quality, especially concerning commercial models.

For these reasons, while from the results indicated in this paper the problem may seem to be relatively easy, especially in the closed-set scenario, things are not going to stay that way for long, and the research community needs to prepare in advance to solve problems related to the detection of fake music. We can already identify some developments not analyzed in this paper that could be considered in future works related to TTM attribution.

For example, Do the textual descriptions have an effect on the classification performance? If text and music are effectively mutually dependent, in the scenario of TTM models, we could be able to leverage that.

Additionally, can we leverage music theory and musicology to detect music deepfakes? The analysis of musical theory could be of interest in a context where generated music is inserted in otherwise “real” music, a problem denoted as splicing.

Moreover, the widespread diffusion of fake music also poses some practical problems. Fake music detectors should be easily deployable on on-line streaming services and in any case where music is streamed live. In order to be able to do this, it is important to create models that are both lightweight enough to be usable in such scenarios and scalable in terms of inference to a high quantity of data.

7. Conclusions

In this paper, we tackled the problem of detecting and attributing music generated via text-to-music models. Specifically, we introduced the FakeMusicCaps dataset, created by replicating the MusicCaps dataset via five state-of-the-art TTM models. By applying simple audio forensics techniques, we demonstrate that the dataset could be used as an initial benchmark to tackle TTM detection and attribution. Future developments will also include extending the dataset to contain captions from datasets other than MusicCaps. Our results are not to be considered definitive, instead, our objective is to further motivate the research in forensics techniques for the analysis of generated music. In fact, while the problem of fake music detection and attribution is now relatively simple, it is guaranteed to grow more extremely complicated day by day.

Author Contributions

Conceptualization, L.C. and P.B.; methodology, L.C. and P.B.; software, L.C.; validation, L.C.; data curation, L.C.; writing L.C., P.B. and S.T.; supervision, P.B. and S.T.; funding acquisition, P.B. and S.T. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Defense Advanced Research Projects Agency (DARPA) and the Air Force Research Laboratory (AFRL) under agreement number FA8750-20-2-1004. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright notation thereon. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of DARPA and AFRL or the U.S. Government. This work is supported by the European Union under the Italian National Recovery and Resilience Plan (NRRP) of NextGenerationEU (PE00000001—program “RESTART”, PE00000014—program “SERICS”). This work is supported by the FOSTERER project, funded by the Italian Ministry of University, and Research within the PRIN 2022 program.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original dataset presented in the study is openly available at https://zenodo.org/records/15063698, accessed on 13 July 2025, while the code used to perform the experiments is available at https://github.com/polimi-ispl/FakeMusicCaps, accessed on 13 July 2025.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Briot, J.P.; Hadjeres, G.; Pachet, F.D. Deep Learning Techniques For Music Generation; Springer: Berlin/Heidelberg, Germany, 2020; Volume 1. [Google Scholar]
Kumar, R.; Seetharaman, P.; Luebs, A.; Kumar, I.; Kumar, K. High-fidelity audio compression with improved rvqgan. In Proceedings of the 37th International Conference on Neural Information Processing Systems, New Orleans, LA, USA, 10–16 December 2023; Volume 36. [Google Scholar]
Défossez, A.; Copet, J.; Synnaeve, G.; Adi, Y. High Fidelity Neural Audio Compression. arXiv 2022, arXiv:2210.13438. [Google Scholar]
Agostinelli, A.; Denk, T.I.; Borsos, Z.; Engel, J.; Verzetti, M.; Caillon, A.; Huang, Q.; Jansen, A.; Roberts, A.; Tagliasacchi, M.; et al. Musiclm: Generating music from text. arXiv 2023, arXiv:2301.11325. [Google Scholar]
Copet, J.; Kreuk, F.; Gat, I.; Remez, T.; Kant, D.; Synnaeve, G.; Adi, Y.; Défossez, A. Simple and controllable music generation. In Proceedings of the 37th International Conference on Neural Information Processing Systems, New Orleans, LA, USA, 10–16 December 2023; Volume 36. [Google Scholar]
Ziv, A.; Gat, I.; Lan, G.L.; Remez, T.; Kreuk, F.; Défossez, A.; Copet, J.; Synnaeve, G.; Adi, Y. Masked audio generation using a single non-autoregressive transformer. arXiv 2024, arXiv:2401.04577. [Google Scholar]
Tal, O.; Ziv, A.; Gat, I.; Kreuk, F.; Adi, Y. Joint Audio and Symbolic Conditioning for Temporally Controlled Text-to-Music Generation. arXiv 2024, arXiv:2406.10970. [Google Scholar]
Liu, H.; Chen, Z.; Yuan, Y.; Mei, X.; Liu, X.; Mandic, D.; Wang, W.; Plumbley, M.D. AudioLDM: Text-to-Audio Generation with Latent Diffusion Models. In Proceedings of the International Conference on Machine Learning, Honolulu, HI, USA, 23–29 July 2023. [Google Scholar]
Liu, H.; Yuan, Y.; Liu, X.; Mei, X.; Kong, Q.; Tian, Q.; Wang, Y.; Wang, W.; Wang, Y.; Plumbley, M.D. Audioldm 2: Learning holistic audio generation with self-supervised pretraining. IEEE/ACM Trans. Audio Speech Lang. Process 2024, 32, 2871–2883. [Google Scholar] [CrossRef]
Chen, K.; Wuderak, Y.; Liu, H.; Nezhurina, M.; Berg-Kirkpatrick, T.; Dubnov, S. Musicldm: Enhancing novelty in text-to-music generation using beat-synchronous mixup strategies. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, Republic of Korea, 14–19 April 2024; pp. 1206–1210. [Google Scholar]
Huang, Q.; Park, D.S.; Wang, T.; Denk, T.I.; Ly, A.; Chen, N.; Zhang, Z.; Zhang, Z.; Yu, J.; Frank, C.; et al. Noise2music: Text-conditioned music generation with diffusion models. arXiv 2023, arXiv:2302.03917. [Google Scholar]
Melechovsky, J.; Guo, Z.; Ghosal, D.; Majumder, N.; Herremans, D.; Poria, S. Mustango: Toward Controllable Text-to-Music Generation. In Proceedings of the NAACL, Mexico City, Mexico, 16–21 June 2024; Association for Computational Linguistics: Vienna, Austria, 2024; pp. 8293–8316. [Google Scholar] [CrossRef]
Ronchini, F.; Comanducci, L.; Perego, G.; Antonacci, F. PAGURI: A user experience study of creative interaction with text-to-music models. arXiv 2024, arXiv:2407.04333. [Google Scholar]
Suno — suno.com. Available online: https://suno.com/ (accessed on 12 September 2024).
Udio|AI Music Generator—Official Website — udio.com. Available online: https://www.udio.com/ (accessed on 12 September 2024).
Feffer, M.; Lipton, Z.C.; Donahue, C. DeepDrake ft. BTS-GAN and TayloRVC: An Exploratory Analysis of Musical Deepfakes and Hosting Platforms. In Proceedings of the HCMIR@ ISMIR, Milan, Italy, 5–9 November 2023. [Google Scholar]
Sha, Z.; Li, Z.; Yu, N.; Zhang, Y. De-fake: Detection and attribution of fake images generated by text-to-image generation models. In Proceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security, Copenhagen, Denmark, 26–30 November 2023; pp. 3418–3432. [Google Scholar]
Yu, N.; Davis, L.; Fritz, M. Attributing Fake Images to GANs: Learning and Analyzing GAN Fingerprints. In Proceedings of the International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019. [Google Scholar]
Corvi, R.; Cozzolino, D.; Zingarini, G.; Poggi, G.; Nagano, K.; Verdoliva, L. On the detection of synthetic images generated by diffusion models. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4–10 June 2023; pp. 1–5. [Google Scholar]
Abady, L.; Wang, J.; Tondi, B.; Barni, M. A siamese-based verification system for open-set architecture attribution of synthetic images. Pattern Recognit. Lett. 2024, 180, 75–81. [Google Scholar] [CrossRef]
Wißmann, A.; Zeiler, S.; Nickel, R.M.; Kolossa, D. Whodunit: Detection and Attribution of Synthetic Images by Leveraging Model-specific Fingerprints. In Proceedings of the ACM International Workshop on Multimedia AI against Disinformation (MAD), Phuket, Thailand, 10–14 June 2024. [Google Scholar]
Mandelli, S.; Bestagini, P.; Verdoliva, L.; Tubaro, S. Facing Device Attribution Problem for Stabilized Video Sequences. IEEE Trans. Inf. Forensics Secur. 2019, 15, 14–27. [Google Scholar] [CrossRef]
Wu, H.; Tseng, Y.; Lee, H.y. CodecFake: Enhancing Anti-Spoofing Models Against Deepfake Audios from Codec-Based Speech Synthesis Systems. In Proceedings of the Interspeech, Kos Island, Greece, 1–5 September 2024. [Google Scholar]
Salvi, D.; Bestagini, P.; Tubaro, S. Exploring the Synthetic Speech Attribution Problem Through Data-Driven Detectors. In Proceedings of the IEEE International Workshop on Information Forensics and Security (WIFS), Shanghai, China, 12–16 December 2022. [Google Scholar]
Bhagtani, K.; Bartusiak, E.R.; Yadav, A.K.S.; Bestagini, P.; Delp, E.J. Synthesized Speech Attribution Using The Patchout Spectrogram Attribution Transformer. In Proceedings of the ACM Workshop on Information Hiding and Multimedia Security (IH&MMSec), Chicago, IL, USA, 28–30 June 2023. [Google Scholar]
Zang, Y.; Zhang, Y.; Heydari, M.; Duan, Z. Singfake: Singing voice deepfake detection. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, Republic of Korea, 14–19 April 2024; pp. 12156–12160. [Google Scholar]
Xie, Y.; Zhou, J.; Lu, X.; Jiang, Z.; Yang, Y.; Cheng, H.; Ye, L. FSD: An initial chinese dataset for fake song detection. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, Republic of Korea, 14–19 April 2024; pp. 4605–4609. [Google Scholar]
Chen, X.; Wu, H.; Jang, J.S.R.; Lee, H.y. Singing Voice Graph Modeling for SingFake Detection. In Proceedings of the Interspeech, Kos Island, Greece, 1–5 September 2024. [Google Scholar]
Desblancs, D.; Meseguer-Brocal, G.; Hennequin, R.; Moussallam, M. From Real to Cloned Singer Identification. In Proceedings of the 25th International Society for Music Information Retrieval Conference, San Francisco, CA, USA, 10–14 November 2024. [Google Scholar]
Guragain, A.; Liu, T.; Pan, Z.; Sailor, H.B.; Wang, Q. Speech Foundation Model Ensembles for the Controlled Singing Voice Deepfake Detection (CtrSVDD) Challenge 2024. In Proceedings of the 2024 IEEE Spoken Language Technology Workshop, Macao, China, 2–5 December 2024. [Google Scholar]
Yizhi, L.; Yuan, R.; Zhang, G.; Ma, Y.; Chen, X.; Yin, H.; Xiao, C.; Lin, C.; Ragni, A.; Benetos, E.; et al. MERT: Acoustic music understanding model with large-scale self-supervised training. In Proceedings of the The Twelfth International Conference on Learning Representations, Singapore, 24–28 April 2023. [Google Scholar]
Baevski, A.; Zhou, Y.; Mohamed, A.; Auli, M. wav2vec 2.0: A framework for self-supervised learning of speech representations. Adv. Neural Inf. Process. Syst. 2020, 33, 12449–12460. [Google Scholar]
Li, Y.; Milling, M.; Specia, L.; Schuller, B.W. From Audio Deepfake Detection to AI-Generated Music Detection–A Pathway and Overview. arXiv 2024, arXiv:2412.00571. [Google Scholar]
Afchar, D.; Meseguer-Brocal, G.; Hennequin, R. AI-Generated Music Detection and its Challenges. In Proceedings of the ICASSP 2025—2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Kothaguda, India, 6–11 April 2025; pp. 1–5. [Google Scholar]
Wei, Z.; Ye, D.; Deng, J.; Lin, Y. From voices to beats: Enhancing music deepfake detection by identifying forgeries in background. In Proceedings of the ICASSP 2025—2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Kothaguda, India, 6–11 April 2025; pp. 1–5. [Google Scholar]
Tak, H.; Patino, J.; Todisco, M.; Nautsch, A.; Evans, N.; Larcher, A. End-to-end anti-spoofing with rawnet2. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–11 June 2021. [Google Scholar]
Jung, J.w.; Heo, H.S.; Tak, H.; Shim, H.j.; Chung, J.S.; Lee, B.J.; Yu, H.J.; Evans, N. AASIST: Audio Anti-Spoofing Using Integrated Spectro-Temporal Graph Attention Networks. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 22–27 May 2022. [Google Scholar]
Gemmeke, J.F.; Ellis, D.P.; Freedman, D.; Jansen, A.; Lawrence, W.; Moore, R.C.; Plakal, M.; Ritter, M. Audio set: An ontology and human-labeled dataset for audio events. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, USA, 5–9 March 2017; pp. 776–780. [Google Scholar]
Manco, I.; Weck, B.; Doh, S.; Won, M.; Zhang, Y.; Bogdanov, D.; Wu, Y.; Chen, K.; Tovstogan, P.; Benetos, E.; et al. The Song Describer Dataset: A Corpus of Audio Captions for Music-and-Language Evaluation. In Proceedings of the Machine Learning for Audio Workshop at NeurIPS, New Orleans, LA, USA, 10–16 December 2023. [Google Scholar]
Roy, A.; Liu, R.; Lu, T.; Herremans, D. JamendoMaxCaps: A Large-Scale Music-Caption Dataset with Imputed Metadata. arXiv 2025, arXiv:2502.07461. [Google Scholar]
Chu, Y.; Xu, J.; Yang, Q.; Wei, H.; Wei, X.; Guo, Z.; Leng, Y.; Lv, Y.; He, J.; Lin, J.; et al. Qwen2-audio technical report. arXiv 2024, arXiv:2407.10759. [Google Scholar]
Evans, Z.; Parker, J.D.; Carr, C.; Zukowski, Z.; Taylor, J.; Pons, J. Stable Audio Open. arXiv 2024, arXiv:2407.14358. [Google Scholar]
Civit, M.; Drai-Zerbib, V.; Lizcano, D.; Escalona, M.J. SunoCaps: A novel dataset of text-prompt based AI-generated music with emotion annotations. Data Brief 2024, 55, 110743. [Google Scholar] [CrossRef] [PubMed]
Rahman, M.A.; Hakim, Z.I.A.; Sarker, N.H.; Paul, B.; Fattah, S.A. SONICS: Synthetic Or Not—Identifying Counterfeit Songs. In Proceedings of the Thirteenth International Conference on Learning Representations, Las Vegas, NV, USA, 11–13 August 2025. [Google Scholar]
Li, Y.; Sun, Q.; Li, H.; Specia, L.; Schuller, B.W. Detecting Machine-Generated Music with Explainability–A Challenge and Early Benchmarks. arXiv 2024, arXiv:2412.13421. [Google Scholar]
Li, Y.; Li, H.; Specia, L.; Schuller, B.W. M6: Multi-generator, Multi-domain, Multi-lingual and cultural, Multi-genres, Multi-instrument Machine-Generated Music Detection Databases. arXiv 2024, arXiv:2412.06001. [Google Scholar]
Kim, C.D.; Kim, B.; Lee, H.; Kim, G. Audiocaps: Generating captions for audios in the wild. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, MN, USA, 2–7 June 2019; Volume 1 (Long and Short Papers), pp. 119–132. [Google Scholar]
Kong, J.; Kim, J.; Bae, J. Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis. Adv. Neural Inf. Process. Syst. 2020, 33, 17022–17033. [Google Scholar]
Wu, Y.; Chen, K.; Zhang, T.; Hui, Y.; Berg-Kirkpatrick, T.; Dubnov, S. Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Ialysos, Greece, 4–10 June 2023; pp. 1–5. [Google Scholar]
Huang, P.Y.; Xu, H.; Li, J.; Baevski, A.; Auli, M.; Galuba, W.; Metze, F.; Feichtenhofer, C. Masked autoencoders that listen. Adv. Neural Inf. Process. Syst. 2022, 35, 28708–28720. [Google Scholar]
Evans, Z.; Parker, J.D.; Carr, C.; Zukowski, Z.; Taylor, J.; Pons, J. Long-form music generation with latent diffusion. arXiv 2024, arXiv:2404.10301. [Google Scholar]
Raffel, C.; Shazeer, N.; Roberts, A.; Lee, K.; Narang, S.; Matena, M.; Zhou, Y.; Li, W.; Liu, P.J. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. 2020, 21, 1–67. [Google Scholar]
Dai, W.; Dai, C.; Qu, S.; Li, J.; Das, S. Very deep convolutional neural networks for raw waveforms. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, USA, 5–9 March 2017; pp. 421–425. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Hendrycks, D.; Gimpel, K. A Baseline for Detecting Misclassified and Out-of-Distribution Examples in Neural Networks. In Proceedings of the International Conference on Learning Representations, Toulon, France, 24–26 April 2017. [Google Scholar]
Kelleher, J.D.; Mac Namee, B.; D’arcy, A. Fundamentals of Machine Learning for Predictive Data Analytics: Algorithms, Worked Examples, and Case Studies; MIT Press: Cambridge, MA, USA, 2020. [Google Scholar]
Sridhar, S.; Cartwright, M. Multi-Label Open-Set Audio Classification. In Proceedings of the 8th Detection and Classification of Acoustic Scenes and Events 2023 Workshop (DCASE2023), Tampere, Finland, 20–22 September 2023; pp. 171–175. [Google Scholar]
You, J.; Wu, W.; Lee, J. Open set classification of sound event. Sci. Rep. 2024, 14, 1282. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Schematic representation of the text-to-music attribution problem.

Figure 2. Flowchart summarizing the creation of the FakeMusicCaps dataset.

Figure 3. Confusion matrices of M5 (top), RawNet2 (middle) and ResNet+Spec (bottom) in the three classification scenarios.

Figure 4. Balanced accuracy varying according to the considered window size using M5 (

), RawNet2 (

), and ResNet + Spec (

).

Table 1. Closed-set classification performances.

Model	${ACC}_{B} ↓$	Precision	Recall	F1 Score
M5	0.90	0.90	0.90	0.90
RawNet2	0.88	0.89	0.88	0.88
ResNet18 + Spec	$1.00$	$1.00$	$1.00$	$1.00$

Table 2. Open set (Threshold) classification performances.

Model	${ACC}_{B} ↓$	Precision	Recall	F1 Score
M5	0.76	0.76	0.76	0.75
RawNet2	0.75	0.75	0.75	0.74
ResNet18 + Spec	0.85	0.78	0.85	0.8

Table 3. Open set (SVM) classification performances.

Model	${ACC}_{B} ↓$	Precision	Recall	F1 Score
M5	0.42	0.67	0.42	0.48
RawNet2	0.47	0.80	0.47	0.52
ResNet18 + Spec	0.48	0.80	0.48	0.56

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

FakeMusicCaps: A Dataset for Detection and Attribution of Synthetic Music Generated via Text-to-Music Models

Abstract

1. Introduction

2. Problem Formulation

3. FakeMusicCaps Dataset

3.1. Considered Architectures

3.2. Generation Strategy

4. Experimental Analysis

4.1. Dataset

4.2. Baselines

4.3. Training

4.4. Classification Techniques

5. Results

5.1. Closed-Set Performances

5.2. Open Set Performances

5.3. Impact of Window Size

6. Discussion

7. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Article Metrics

Citations

Article Access Statistics