Spectrogram Dataset of Korean Smartphone Audio Files Forged Using the “Mix Paste” Command

: This study focuses on the ﬁeld of voice forgery detection, which is increasing in importance owing to the introduction of advanced voice editing technologies and the proliferation of smartphones. This study introduces a unique dataset that was built speciﬁcally to identify forgeries created using the “Mix Paste” technique. This editing technique can overlay audio segments from similar or different environments without creating a new timeframe, making it nearly infeasible to detect forgeries using traditional methods. The dataset consists of 4665 and 45,672 spectrogram images from 1555 original audio ﬁles and 15,224 forged audio ﬁles, respectively. The original audio was recorded using iPhone and Samsung Galaxy smartphones to ensure a realistic sampling environment. The forged ﬁles were created from these recordings and subsequently converted into spectrograms. The dataset also provided the metadata of the original voice ﬁles, offering additional context and information that could be used for analysis and detection. This dataset not only ﬁlls a gap in existing research but also provides valuable support for developing more efﬁcient deep learning models for voice forgery detection. By addressing the “Mix Paste” technique, the dataset caters to a critical need in voice authentication and forensics, potentially contributing to enhancing security in society. Dataset: https://drive.google.com/drive/folders/10cBCvQTF-XqCfdQuUU4y_ssrbi3hUJkw (accessed on 19 November 2023). Dataset License


Summary
With the rapid advancement and penetration of smartphones, voice recording using these devices has become common.Additionally, the importance of audio authentication and voice recording forensics is increasing owing to developments in voice file-editing software [1].Recently, voice processing using Deep Voice 3 software and deepfakes has introduced significant challenges related to the integrity and authenticity of digital evidence [2,3].These forged audio files cannot be detected using the human ear, necessitating the development of powerful voice forgery detection technology [4].
Editing techniques for voice files can vary from audio enhancement to pitch manipulation; however, the basic editing functions used in voice file forgery and forgery techniques, which are deletion, insertion, and copy-move, are standard.Among these, copy-moves are difficult to detect because the forged segments originate from the same audio file [5].Recently, with the popularity of audio editing software such as Adobe Audition CC, audio file content can be easily edited in various ways [4,6].In addition to basic editing functions, this software provides the "Mix Paste" command, which selects the desired part of a current Data 2023, 8, 183 2 of 9 audio file, copies it, and combines it by pasting.Using this command, it is possible to overlap or compose an empty space without creating a new timeframe [7], making it difficult for existing detection methods to detect forged content.Moreover, using this command, voice clips created with Deep Voice software can be easily synthesized into voice files recorded in a physical environment.Recently, research on detecting voice-forged files created through editing techniques such as splicing and copy-move [5,[8][9][10] and voice-forged files created through Deep Voice software and audio deepfakes [11][12][13] has been actively conducted using deep learning models, which mainly convert voice signals into spectrograms and use them as a dataset.
The purpose of this study was to construct a spectrogram dataset of forged Korean voice files recorded on smartphones to develop a deep learning model for detecting voice files that were forged using the "Mix Paste" command.This dataset consisted of spectrogram images that had been converted from original audio files and spectrogram images that had been converted from forged audio files edited by "Mix Paste."Furthermore, this dataset provides metadata about the original voice file and the location of the section where this command was applied.The original recording file used here was obtained with the consent of the speakers.However, our dataset was constructed using high-resolution spectrogram images to support enhanced privacy protection and provide easier and faster access for developing deep learning models for voice forgery detection.In other words, the training time for developing deep learning models can be reduced, and these models can even be operated on personal computers with low performance.
To build this dataset, four speakers were recorded using iPhone and Samsung Galaxy smartphones, and forged files were created using these recorded files.Additionally, the metadata of the original voice file were extracted, and the forged section of the forged voice file was recorded.After the forged voice file was encoded into the same voice file as that of the original file, the original and forged voice files were converted into a spectrogram and saved as an image file.
Currently, there are datasets for detecting audio deepfakes.However, datasets that can detect audio files that have been forged or altered by editing are rare.Existing audio deepfake detection datasets include Automatic Speaker Verification spoof (ASVspoof) 2021 [14], WaveFake [15], and In-the-Wild Audio Deepfake Data [16].ASVspoof 2021 is a dataset containing representative audio deepfakes and is entirely composed of "logical access", "physical access", and "speech deepfake" [17].Additionally, there is a Chinese dataset called the Yuan Ze Mandarin Dataset, which focuses on detecting forgeries produced through editing.This dataset was constructed by manually applying deletion and splicing identical sentences using Audacity, audio editing software [18].
To the best of our knowledge, there are no voice datasets that have been edited using "Mix Paste."However, this dataset has limitations in that it is a Korean dataset, and the raw data were constructed by four speakers.Nevertheless, compared to classical datasets edited based on speech corpus, this dataset is a real forged dataset that was directly recorded and edited by humans, re-encoded in a similar way to the original, and provides information such as the forged sections and the bona fide sections.In these respects, this dataset is valuable.
The proposed dataset can be used to derive insights through data analysis that will be useful in detecting voice forgeries and developing useful detection algorithms.Additionally, this dataset can be used in various machine-learning fields, such as classification, to determine the type of recording device that is used in a scenario.Furthermore, we believe that releasing this dataset will contribute to the advancement of deep learning technologies used for detecting forgeries in voice files.

Data Description
The constructed spectrogram dataset contained 1555 original voice files and 15,224 forged voice files that had been edited with "Mix Paste."The original and forged voice files were converted into log, linear, and Mel spectrograms; there were 4665 and 45,672 spectro-gram images from the 1555 original voice files and forged voice files, respectively.The spectrogram images were saved at a high resolution of 4608 × 3456 (linear/log) and 4320 × 2880 (Mel).

Original Audio
The original audio was recorded as 933 and 622 files on iPhone and Samsung Galaxy smartphones, respectively, with a total recording time of 122,378 s.The 1555 original files had minimum, maximum, and average lengths of 80, 100, and 90 s, respectively.Each voice file contained 10-15 utterances, with an average of 13 utterances.Additionally, metadata extracted from the original voice file were provided to increase the usability of this dataset.

Forged Audio
There were 15,224 voices forged through "Mix Paste" based on the original voice recording file, with minimum, maximum, and average lengths of 66, 93, and 78 s, respectively.Information on the section to which this command had been applied and the section of the source that had been used was also provided.Table 1 shows the specifications of the dataset.

Methods
The process that was employed to construct the forged voice dataset can be divided into four steps: (1) voice recording, (2) editing using "Mix Paste", (3) encoding, and (4) data preprocessing.Through these steps, a forged voice file was created, and the original and forged audio were converted into spectrogram images and saved.Figure 1 illustrates the dataset construction process that we used.

Voice Recording
To build this dataset, four speakers-two men and two women-were recorded using the following smartphones: Apple iPhone 14 Pro Max, Apple iPhone 13 Mini, Samsung Galaxy Note 20, and Samsung Galaxy S23+; the recording software was the software built into each smartphone: Voice Memo on the iPhone and Voice Recorder on the Galaxy.Files recorded in this software have the same sampling rate of 48,000 Hz. Figure 2 shows

Voice Recording
To build this dataset, four speakers-two men and two women-were recorded using the following smartphones: Apple iPhone 14 Pro Max, Apple iPhone 13 Mini, Samsung Galaxy Note 20, and Samsung Galaxy S23+; the recording software was the software built into each smartphone: Voice Memo on the iPhone and Voice Recorder on the Galaxy.Files recorded in this software have the same sampling rate of 48,000 Hz. Figure 2 shows the proportion of each recording device.Various feature points were extracted from different Korean pronunciations.Figure 3 shows the distribution of the plain, aspirated, sibilant, and tense Korean consonants in the recording script.

Voice Recording
To build this dataset, four speakers-two men and two women-were recorded ing the following smartphones: Apple iPhone 14 Pro Max, Apple iPhone 13 Mini, Sa sung Galaxy Note 20, and Samsung Galaxy S23+; the recording software was the softw built into each smartphone: Voice Memo on the iPhone and Voice Recorder on the Gala Files recorded in this software have the same sampling rate of 48,000 Hz. Figure 2 sho the proportion of each recording device.Various feature points were extracted from d ferent Korean pronunciations.Figure 3 shows the distribution of the plain, aspirated, s ilant, and tense Korean consonants in the recording script.

Voice Recording
To build this dataset, four speakers-two men and two women-were recorded ing the following smartphones: Apple iPhone 14 Pro Max, Apple iPhone 13 Mini, Sa sung Galaxy Note 20, and Samsung Galaxy S23+; the recording software was the softw built into each smartphone: Voice Memo on the iPhone and Voice Recorder on the Gala Files recorded in this software have the same sampling rate of 48,000 Hz. Figure 2 sho the proportion of each recording device.Various feature points were extracted from d ferent Korean pronunciations.Figure 3 shows the distribution of the plain, aspirated, s ilant, and tense Korean consonants in the recording script.

Voice File Editing
The "Mix Paste" editing function is available under Adobe Audition, which is the most widely used voice editing tool in Korea.One of the speech sections was selected and copied from the original voice file.Then, it was pasted by setting the "Overlab" option in an empty space without speech (Figure 4).We marked the original and pasted regions with "markers" and then saved them in WAV format.At this time, the sample positions of the marker's start time and duration were stored in a WAV file by Adobe's Extensible Metadata Platform (XMP).The sample positions of the marker's start time and duration were extracted from the metadata of these WAV files, and the start time and duration of each of the original and pasted areas were saved in a CSV file.A forged voice file was created by pasting one of the speech sections of the original voice file, and several edited files were created from the original voice file, resulting in the number of forged voice files exceeding that of the original voice files.
were extracted from the metadata of these WAV files, and the start time and duration of each of the original and pasted areas were saved in a CSV file.A forged voice file was created by pasting one of the speech sections of the original voice file, and several edited files were created from the original voice file, resulting in the number of forged voice files exceeding that of the original voice files.

Encoding Edited Voice Files
The third step was encoding the edited data.The file format, metadata, and structure of the edited data needed to match those of the original [19].Therefore, the WAV format needed to be encoded in M4A format.The encoder needed to encode the data of the file such that it would contain a file structure and metadata that closely resembled those of the original file.Consider a case where a file recorded on an iPhone was encoded by selecting "Good Quality" in the Advanced Audio Coding encoder provided by iTunes; it had a sample rate of 48,000 Hz, which was the same as that of the original.With the exception of some metadata, the encoded file could be made to almost resemble the original.Moreover, because the metadata and file structure could be changed to resemble those of the original file using the Hex editor, forged content could not be detected simply on the basis of metadata or file structure.Figure 5 illustrates the settings employed for encoding using iTunes.

Encoding Edited Voice Files
The third step was encoding the edited data.The file format, metadata, and structure of the edited data needed to match those of the original [19].Therefore, the WAV format needed to be encoded in M4A format.The encoder needed to encode the data of the file such that it would contain a file structure and metadata that closely resembled those of the original file.Consider a case where a file recorded on an iPhone was encoded by selecting "Good Quality" in the Advanced Audio Coding encoder provided by iTunes; it had a sample rate of 48,000 Hz, which was the same as that of the original.With the exception of some metadata, the encoded file could be made to almost resemble the original.Moreover, because the metadata and file structure could be changed to resemble those of the original file using the Hex editor, forged content could not be detected simply on the basis of metadata or file structure.Figure 5 illustrates the settings employed for encoding using iTunes.Samsung Galaxy smartphones do not require separate encoders.Using the convolutional neural network model, we confirmed that irrespective of the encoder used for the original file, the spectrogram would be unaffected.Therefore, in the case of recorded files on Galaxy smartphones, encoding was performed at the same sample rate as that of the original, namely 44,100 Hz for Galaxy Note 20 and 48,000 Hz for Galaxy S23+, using an online encoding site [20].on Galaxy smartphones, encoding was performed at the same sample rate as that of the original, namely 44,100 Hz for Galaxy Note 20 and 48,000 Hz for Galaxy S23+, using an online encoding site [20].

Data Preprocessing
Finally, the original and forged voice files were converted to a spectrogram image using the "librosa" module [21] in Python and saved as a PNG file.The iPhone utilized for this data collection has a frequency bandwidth of approximately 24,000 Hz and a cutoff frequency of approximately 16,000 Hz.On the other hand, Samsung Galaxy smartphones have a frequency bandwidth of approximately 23,000 Hz and a cutoff frequency of approximately 20,000 Hz.By calculating the ratio of this frequency information to the height value of the spectrogram image and the ratio of the time information of the bona fide and forged sections to the value of the spectrogram image, the bona fide and forged sections could be accurately labeled as a bounding box (Figure 6).As the forged segments were obtained from the same audio file, similar to what is carried out with the copy-move technique, the bona fide and forged segments coexist in a single file.Samsung Galaxy smartphones do not require separate encoders.Using the convolutional neural network model, we confirmed that irrespective of the encoder used for the original file, the spectrogram would be unaffected.Therefore, in the case of recorded files on Galaxy smartphones, encoding was performed at the same sample rate as that of the original, namely 44,100 Hz for Galaxy Note 20 and 48,000 Hz for Galaxy S23+, using an online encoding site [20].

Data Preprocessing
Finally, the original and forged voice files were converted to a spectrogram image using the "librosa" module [21] in Python and saved as a PNG file.The iPhone utilized for this data collection has a frequency bandwidth of approximately 24,000 Hz and a cutoff frequency of approximately 16,000 Hz.On the other hand, Samsung Galaxy smartphones have a frequency bandwidth of approximately 23,000 Hz and a cutoff frequency of approximately 20,000 Hz.By calculating the ratio of this frequency information to the height value of the spectrogram image and the ratio of the time information of the bona fide and forged sections to the width value of the spectrogram image, the bona fide and forged sections could be accurately labeled as a bounding box (Figure 6).As the forged segments were obtained from the same audio file, similar to what is carried out with the copy-move technique, the bona fide and forged segments coexist in a single file.

Dataset Verification
To prove the applicability of this proposed dataset, we performed validation by building a deep learning model for speaker identification during speech recognition.Speaker identification is commonly performed based on a convolutional neural network (CNN) based on a Mel spectrogram [22].To do this effectively, transfer learning to use the VGG19 pre-trained CNN was performed [23].Figure 7 shows the VGG19-based transfer learning model.

Dataset Verification
To prove the applicability of this proposed dataset, we performed validation by building a deep learning model for speaker identification during speech recognition.Speaker identification is commonly performed based on a convolutional neural network (CNN) based on a Mel spectrogram [22].To do this effectively, transfer learning to use the VGG19 pre-trained CNN was performed [23].Figure 7 shows the VGG19-based transfer learning model.  2 shows the composition of the dataset according to classes.An experiment was performed by dividing this Mel spectrogram image into a 70% training set, a 10% validation set, and a 20% test set.This transfer learning model was evaluated using accuracy, precision, recall, and F1 scores.Table 3 shows the classification evaluation metrics performed on the test dataset.  of the dataset according to classes.An experiment was performed by dividing this Mel spectrogram image into a 70% training set, a 10% validation set, and a 20% test set.This transfer learning model was evaluated using accuracy, precision, recall, and F1 scores.Table 3 shows the classification evaluation metrics performed on the test dataset.Various experiments may be necessary, but as shown in Table 3, overall performance meets expectations.Therefore, it is that this proposed dataset satisfies the qualitative aspects.The "Metadata" folder contained the metadata of the audio file and the audio file structure in JSON format, as well as the gender of the speaker, type of recording device, and operating system information collected during the recording process.The file name of each JSON file was the same as that of the original spectrogram image in the spectrogram folder.The label.csv file contained information regarding the forged section of the forged audio file.Regarding the forged section, the "fake_start" and "duration" columns indicated the start time and length of this section.The bona fide section that was used to be forged was marked with a "real_start" column.Because "duration" is the same, only one "duration" column is indicated.Furthermore, the forged section was converted into text using Whisper, an application programming interface developed by OpenAI to handle speech-to-text (STT) tasks [24]; the "text" column indicated the STT results.

Figure 2 .
Figure 2. Proportions of the recording devices used in the study.

Figure 3 .
Figure 3. Distribution of the plain, aspirated, sibilant, and tense consonants in the recording sc used in the study.

Figure 2 .
Figure 2. Proportions of the recording devices used in the study.

Figure 2 .
Figure 2. Proportions of the recording devices used in the study.

Figure 3 .
Figure 3. Distribution of the plain, aspirated, sibilant, and tense consonants in the recording sc used in the study.

Figure 3 .
Figure 3. Distribution of the plain, aspirated, sibilant, and tense consonants in the recording script used in the study.

Figure 5 .
Figure 5. Settings for encoding in iTunes.Samsung Galaxy smartphones do not require separate encoders.Using the convolutional neural network model, we confirmed that irrespective of the encoder used for the original file, the spectrogram would be unaffected.Therefore, in the case of recorded files

Figure 5 .
Figure 5. Settings for encoding in iTunes.

Figure 6 .
Figure 6.Bona fide bounding box and forged bounding box on linear spectrogram.

Figure 6 .
Figure 6.Bona fide bounding box and forged bounding box on linear spectrogram.

Figure 7 .
Figure 7. VGG19-based transfer learning model.The proposed dataset was recorded by four speakers, and this dataset included 1555 Mel spectrogram images from this original audio.Table2shows the composition of the dataset according to classes.An experiment was performed by dividing this Mel spectrogram image into a 70% training set, a 10% validation set, and a 20% test set.This transfer learning model was evaluated using accuracy, precision, recall, and F1 scores.Table3shows the classification evaluation metrics performed on the test dataset.

Figure 7 .
Figure 7. VGG19-based transfer learning model.The proposed dataset was recorded by four speakers, and this dataset included 1555 Mel spectrogram images from this original audio.Table 2 shows the composition

Figure 8 .
Figure 8. Structure of the proposed dataset.

Table 1 .
Specifications of the dataset.

Table 2
shows the composition

Table 2 .
Composition of the dataset used in the experiment.

Table 3 .
Evaluation metrics for speaker identification.