Melody Extraction and Encoding Method for Generating Healthcare Music Automatically

: The strong relationship between music and health has helped prove that soft and peaceful classical music can signiﬁcantly reduce people’s stress; however, it is di ﬃ cult to identify and collect examples of such music to build a library. Therefore, a system is required that can automatically generate similar classical music selections from a small amount of input music. Melody is the main element that reﬂects the rhythms and emotions of musical works; therefore, most automatic music generation research is based on melody. Given that melody varies frequently within musical bars, the latter are used as the basic units of composition. As such, there is a requirement for melody extraction techniques and bar-based encoding methods for automatic generation of bar-based music using melodies. This paper proposes a method that handles melody track extraction and bar encoding. First, the melody track is extracted using a pitch-based term frequency–inverse document frequency (TFIDF) algorithm and a feature-based ﬁlter. Subsequently, four speciﬁc features of the notes within a bar are encoded into a ﬁxed-size matrix during bar encoding. We conduct experiments to determine the accuracy of track extraction based on veriﬁcation data obtained with the TFIDF algorithm and the ﬁlter; an accuracy of 94.7% was calculated based on whether the extracted track was a melody track. The estimated value demonstrates that the proposed method can accurately extract melody tracks. This paper discusses methods for automatically extracting melody tracks from MIDI ﬁles and encoding based on bars. The possibility of generating music through deep learning neural networks is facilitated by the methods we examine within this work. To help the neural networks generate higher quality music, which is good for human health, the data preprocessing methods contained herein should be improved in future works.


Introduction
Many studies that have analyzed the relationship between music and health found that music positively affects human health [1][2][3][4][5]. Music has been shown to reduce human stress levels and has a considerable positive effect on psychological state [6][7][8][9]. Although serene classical music is known to be beneficial during patient treatment, it is time-consuming to collect specific classical music pieces that have been shown to positive affect health; therefore, presenting a need for a system is to automatically generate beneficial musical pieces. Recently, many researchers have begun to explore the possibilities of deep learning techniques in the field of artistic creation. Deep learning has had great achievement in search engine, data mining, machine translation, natural language processing, computer vision, and other related fields. The deep learning algorithm has given great support on interpretation of data such as text, images, and sounds by learning the rules in the sample data. Its goal is to enable machines to have the ability to analyze text [10], images [11], and sounds [12] like humans. The achievements Additionally, the generator can be used to modulate the instruments and style of music. Instruments whose sounds are beneficial to health can be soloed or featured during playback of the generated music. In order to evaluate the generated music, audience-based evaluation can be leveraged to assess the quality of generated music [27]. However, in order to evaluate automatically, we utilized the pitch structure to determine if the generated music and the music within the training dataset are sufficiently similar.
Our contributions are as follows: • Automatically melody extraction: In the automatic music generation research based on MIDI files, a melody track is required as the learning data when generating the melody. This research utilizes the TFIDF algorithm and feature-based filters to automatically extract melody tracks from MIDI files.

•
Bar encoding method for more features: Bar encoding encodes the start time, duration, pitch, and intensity of the notes contained in the bar into the bar matrix. The bar matrix can accurately represent the detailed features of each note. • Improved preprocessing for music generation based on deep learning: Automatic melody extraction and bar encoding can enable the deep learning neural network to train the composition method better.
The remainder of this paper is organized as follows. Section 2 introduces studies on deep learning-based music generation. Section 3 describes melody track extraction and the bar encoding methods. Section 4 describes the content and results of the experiments. Section 5 contains our concluding remarks.

Related Literature
Typical data representations used during the design and training of music automatic generation algorithms are signal, transformed signal (e.g., spectrum, via Fourier transformation), piano roll, MIDI, text, etc. The most recent research on deep learning methods for music generation use only the melody track contained within MIDI files as training data [25,28]. However, there are other, non-melody tracks within MIDI files. Traditional melody extraction research includes Skyline [29], Best Channel [30], Top Channel [31], Entropy Channel [32], Entropy Part [33], and Revised Skyline [34]. The most common method of extracting melody tracks of extracting melody tracks from MIDI files is the Skyline algorithm [26], which combines all notes into a single track and selects the highest pitch sounds (typically a representative feature of the melody portions of a musical piece). However, there are three disadvantages with the Skyline algorithm. First, the hold time of the note itself can change if a new note with a higher pitch appears; meaning that the playing of the first note ceases, and the new note is played instead. Consequently, the extracted melody track may be different from the true melody of the musical composition. Second, notes of low intensity in a melody track may be erased. Skyline automatically finds and fills all time periods with the highest previously played pitch if no sounds are played for a certain period; thereby, further altering the melody. Third, there are many compositions in which the melody contains pitches lower than those of other tracks. Therefore, the Skyline algorithm cannot derive melody tracks accurately.
Madsen et al. [35] proposed a method for measuring musical complexity and applied it to a melody extraction model. The music complexity model did not utilize musical expertise. TANG et al. [36] assumed that there is only one melody track in music, and [37] proved that this assumption is universal. When extracting the melody track, five methods were proposed. The first three methods, AvgVel, PMRatio, and SilenceRatio, sort the probabilities of the tracks considered to be melody by analyzing the track features. The other two methods, TrackName and Range, can directly divide the tracks into melody tracks or non-melody tracks. In the 16,012 collected MIDI files, 247 melody tracks were extracted by TrackName. TrackName is the most efficient way to extract melody tracks in five ways, but only a small number of MIDI files can be sorted by TrackName Rizo et al. [38] proposed a feature-based melody filtering method for determining a melody track by analyzing features important to the melody track, differentiating it from other tracks. Their method uses track information, pitch, intensity, lasting time, and total number of notes as track features [39]. However, this method cannot be applied to large volumes of data as the labeling of features is performed manually. Herein, we apply the TFIDF algorithm to the traditional feature-based filtering method. By first using the pitch frequency-based TFIDF algorithm, we can remove most of the non-melody tracks both quickly and automatically.
Most methods of melody extraction only focus on the temporal continuity of the frequency, and seldom consider the temporal correlations between the frequency and amplitude, resulting in the poor extractions of the melody by many algorithms. Consequently, we propose a new melody extraction method [40] based on the frequency-amplitude correlations.
Within the automatic music generation literature using recurrent neural networks, there are different data representation encoding methods. Table 1 describes the difference between various data representation and encoding methods used by previous studies and proposed methods. Mogren [25] utilizes MIDI files as training data. The pitch, start time, hold time, and intensity of the note are all accurately represented. As it contains a note end event, it addresses this oversight of the Piano-roll representation. However, notes-based music generation method is unsuitable to generate the music [42]. There are limitations in performance compared to the bar encoding. In MidNet [28], MIDI files and bar-based encoding method are utilized; it may not be trivial to easily distinguish between a long note and two short repeating notes (i.e., consecutive notes with the same pitch). MuseGAN [41] utilizes the data represented by the Piano-roll [43] format for its training data. Piano-roll is one of the most commonly used data representations and can represent both the pitch and hold time of a sound; however, it has several limitations. First, note termination information does not exist. Therefore, it is impossible to distinguish between a long note and two short and equal notes. Second, Piano-roll cannot represent the intensity of a note.
In order to solve the above problem, this paper proposes two methods. First, the melody tracks extracted by TrackName are used as the comparison dataset. The difference between the non-melody track and the comparison dataset is calculated by the TFIDF algorithm. Then, the track with the minimum TFIDF is extracted as the melody track. Second, in order to represent more features of note, an encoding method based on bar is proposed to represent the start time, lasting time, pitch, and intensity of the note.

Music Generation Process of the Proposed Method
Our proposed music generation process method is shown in Figure 1. First, during the melody track extraction phase, the TFIDF algorithm and feature-based filtering are used to extract multiple melody tracks from the collected MIDI files. Second, during the bar encoding phase, the extracted melody tracks are encoded based on bars. As the proposed method generates MIDI files based on bars, melody tracks are segmented into multiple bars. Each divided bar is encoded based on the pitch, start time, length, and intensity of the notes within the bar. Next, data are integrated into a single melody matrix. Third, during the MIDI file generation phase, the melody matrix sets are used to train a deep learning-based music generation model. Accordingly, new MIDI files are generated.

TFIDF-based Melody Track Extraction
The melody, chord, and drum accompaniment are stored in different tracks within the MIDI file. This section describes the methods used to extract the melody track from the MIDI file.
Typically, the TrackNameEvent attribute track metatags to the melody track. If the value is "Melody," the track is a melody track. If there is no TrackNameEvent attribute in the track, or if its value is not "Melody," it is difficult to determine whether the track is a melody track or not. Herein, we assume that there is at most one melody track in a MIDI file. The proposed method uses the TFIDF algorithm and feature-based filtering to extract the melody tracks from the MIDI files, as shown in Figure 2.
For example, 60, 70, 65, 79, 61, 62, 61, 64 , is a set of pitches, which indicates a pitch included in one of the tracks in the MIDI. When using the TFIDF algorithm, the TF (term frequency) value of each note in each track must be calculated. The TF value represents the number of times a note appears in the current track. For example, if a 60 Hz pitch appears once in the first track, then the TF value corresponding to this pitch is 1. The IDF (inverse document frequency) value of each note in each track then needs to be calculated. The IDF value represents the reciprocal of the number of tracks containing this note within the collected melody tracks. If the number of tracks containing this note is 1000, then the IDF value of the note is 0.001. The TFIDF value of each note in the track is determined by multiplying the TF and IDF values. Based on the example note discussed, the TFIDF value of this note is 0.001. The average of the TFIDF values of each note within the track is the TFIDF value of the track. If the current track and melody track set are similar, the TFIDF average value will be low. Therefore, the track with the lowest average value of TFIDF will be extracted as the melody track.
First, the collected MIDI file set, M, is classified. The total MIDI file set contains files that are both labeled and unlabeled as melodies via the TrackNameEvent attribute; the unclassified files are represented as and the classified MIDIs are referred to as . The th MIDI files, , , and obey the relationships ∈ , ∈ , and ∈ , respectively. If "Melody" appears in the TrackNameEvent of the tracks included in the th MIDI file , it is added to the MIDI file set .

TFIDF-Based Melody Track Extraction
The melody, chord, and drum accompaniment are stored in different tracks within the MIDI file. This section describes the methods used to extract the melody track from the MIDI file.
Typically, the TrackNameEvent attribute track metatags to the melody track. If the value is "Melody," the track is a melody track. If there is no TrackNameEvent attribute in the track, or if its value is not "Melody," it is difficult to determine whether the track is a melody track or not. Herein, we assume that there is at most one melody track in a MIDI file. The proposed method uses the TFIDF algorithm and feature-based filtering to extract the melody tracks from the MIDI files, as shown in Figure 2.
For example, T = {60, 70, 65, 79, 61, 62, 61, 64}, T is a set of pitches, which indicates a pitch included in one of the tracks in the MIDI. When using the TFIDF algorithm, the TF (term frequency) value of each note in each track must be calculated. The TF value represents the number of times a note appears in the current track. For example, if a 60 Hz pitch appears once in the first track, then the TF value corresponding to this pitch is 1. The IDF (inverse document frequency) value of each note in each track then needs to be calculated. The IDF value represents the reciprocal of the number of tracks containing this note within the collected melody tracks. If the number of tracks containing this note is 1000, then the IDF value of the note is 0.001. The TFIDF value of each note in the track is determined by multiplying the TF and IDF values. Based on the example note discussed, the TFIDF value of this note is 0.001. The average of the TFIDF values of each note within the track is the TFIDF value of the track. If the current track and melody track set are similar, the TFIDF average value will be low. Therefore, the track with the lowest average value of TFIDF will be extracted as the melody track. If TrackNameEvent does not exist or if "Melody" does not appear, it is added to the unconfirmed MIDI file set, . Second, the melody tracks of the MIDI files are separated. All tracks included in each unconfirmed MIDI file in the unconfirmed MIDI file set are defined as the unconfirmed track set . Accordingly, , , which is the j th track in the unconfirmed MIDI file , is added to the unconfirmed track set . The j th track , is , ∈ , and ∈ . In the MIDI file set that includes melody, the melody track set is defined as the only melody track set. The melody tracks are extracted from the MIDI file set that includes melodies, and they are added to the melody track set . Correspondingly, ∈ . Third, the TFIDF value is calculated. The TFIDF algorithm is a method of calculating the difference between an element and a set [26]. It is used to (a) calculate the difference between track , and the tracks in the melody track set and (b) to use the calculated TFIDF value to represent the track. Pitch is the degree of highness or lowness of a note. Additionally, is the minimum value in the allowable pitch range and is the maximum value in the allowable pitch range. Pitch p is a pitch in the allowable pitch range and it is an integer between and . Equivalently, is the number of tracks in melody track set in which pitch p appears at least once. In turn, , is the total number of notes in the individual track , . Additionally, , , is the frequency of pitch p in the separated track , . The number of tracks in the Second, the melody tracks of the MIDI files are separated. All tracks included in each unconfirmed MIDI file m U i in the unconfirmed MIDI file set M U are defined as the unconfirmed track set T U i .
In the MIDI file set M M that includes melody, the melody track set T M is defined as the only melody track set. The melody tracks are extracted from the MIDI file set M M that includes melodies, and they are added to the melody track set T M . Correspondingly, t M i ∈ T M . Third, the TFIDF value is calculated. The TFIDF algorithm is a method of calculating the difference between an element and a set [26]. It is used to (a) calculate the difference between track t U i,j and the tracks in the melody track set T M and (b) to use the calculated TFIDF value to represent the track. Pitch is the degree of highness or lowness of a note. Additionally, θ MIN is the minimum value in the allowable pitch range and θ MAX is the maximum value in the allowable pitch range. Pitch p is a pitch in the allowable pitch range and it is an integer between θ MIN and θ MAX . Equivalently, n T M p is the number of tracks in melody track set T M in which pitch p appears at least once. In turn, s U i,j is the total number of notes in the individual track t U i,j . Additionally, n U i,j,p is the frequency of pitch p in the separated track t U i,j . The number of tracks in the melody track set T M is defined as the track quantity α, while f U i,j,p is the TFIDF value of pitch p in the separated track t U i, j ; f U i,j is the average TFIDF value of pitch p corresponding to the TFIDF value of the separated track t U i,j . In the proposed method, the TFIDF value of track t U i,j is calculated based on the pitch frequency, as shown in Algorithm 1. Algorithm 1. Term frequency-inverse document frequency (TFIDF) value calculation algorithm.
Fourth, shallow structure description [38] of each unconfirmed MIDI file melody track is used to perform filtering. The shallow structure description includes the track's maximum pitch, minimum pitch, average pitch, maximum lasting time, minimum lasting time, average lasting time, maximum intensity, minimum intensity, and average intensity. The proposed method filters the tracks using the average pitch, average lasting time, and average intensity that were derived from the dataset analysis. For the k th note in the melody track t M i , the pitch is defined as  (1).
The average pitch, lasting time, and intensity of each melody track feature are then used to determine the filtering values for each feature. In this respect, p MIN is defined as the pitch filtering value setting the minimum value of the average pitch calculated in the melody tracks; h MIN and h MAX are defined as the lasting time filtering values, where h MIN sets the minimum value of the average lasting time calculated in the melody tracks and h MAX . sets the maximum value of the average lasting time calculated in the melody tracks. The e MIN value is defined as the filtering value for the intensity of the notes and is set as the minimum value of the average intensity calculated in the melody tracks. Track t U i,j , which is found based on the lowest TFIDF value f U i,j , undergoes a three-stage filtering process. The average pitch of t U i, j is checked to ascertain if it is less than p MIN , and the average note lasting time of t U i, j is checked to assess whether it is less than h MIN or greater than h MAX . The average note intensity of track t U i,j is checked to assess if it is less than e MIN . If the average pitch, average note lasting time, and average note intensity of t U i,j do not satisfy the set conditions, the track is deleted and the track with the next lowest TFIDF value is selected to perform the filtering process. Fifth, the melody tracks to be used in the unconfirmed MIDI file set are determined. As the TFIDF values of the tracks included in the unconfirmed MIDI files become smaller, the track set becomes more similar with the melody track set T M . The unconfirmed MIDI files are used to define the preprocessed unconfirmed track set T ,U . It is defined as the preprocessed unconfirmed tracks t ,U i in the unconfirmed MIDI track files. Accordingly, t ,U i ∈ T ,U . For each unconfirmed MIDI file, the track with the lowest TFIDF value is set as an unconfirmed track t ,U i . The preprocessed unconfirmed track t ,U i that is selected in the unconfirmed MIDI file is set as the preprocessed unconfirmed track set T ,U . Sixth, the melody tracks to be used during music generation are created. The melody track set T is defined as the melody tracks used in music generation. The melody track set T is created by combining the preprocessed unconfirmed track set T ,U and the melody track set T M . The melody-containing track set T is ultimately used to generate music.

Bar Encoding
Encoding is used to show each bar contained in a melody track as a matrix. Traditional music generation studies have operated on insufficient musically relevant features as they perform note-based encoding. To resolve this, we encode tracks within out process based on bars, which are groups of notes. The proposed bar encoding method is shown in Figure 3. for the intensity of the notes and is set as the minimum value of the average intensity calculated in the melody tracks. Track , , which is found based on the lowest TFIDF value , , undergoes a threestage filtering process. The average pitch of , is checked to ascertain if it is less than , and the average note lasting time of , is checked to assess whether it is less than ℎ or greater than ℎ . The average note intensity of track , is checked to assess if it is less than . If the average pitch, average note lasting time, and average note intensity of , do not satisfy the set conditions, the track is deleted and the track with the next lowest TFIDF value is selected to perform the filtering process. Fifth, the melody tracks to be used in the unconfirmed MIDI file set are determined. As the TFIDF values of the tracks included in the unconfirmed MIDI files become smaller, the track set becomes more similar with the melody track set . The unconfirmed MIDI files are used to define the preprocessed unconfirmed track set , . It is defined as the preprocessed unconfirmed tracks , in the unconfirmed MIDI track files. Accordingly, , ∈ , . For each unconfirmed MIDI file, the track with the lowest TFIDF value is set as an unconfirmed track , . The preprocessed unconfirmed track , that is selected in the unconfirmed MIDI file is set as the preprocessed unconfirmed track set , .
Sixth, the melody tracks to be used during music generation are created. The melody track set T is defined as the melody tracks used in music generation. The melody track set T is created by combining the preprocessed unconfirmed track set , and the melody track set . The melodycontaining track set T is ultimately used to generate music.

Bar Encoding
Encoding is used to show each bar contained in a melody track as a matrix. Traditional music generation studies have operated on insufficient musically relevant features as they perform notebased encoding. To resolve this, we encode tracks within out process based on bars, which are groups of notes. The proposed bar encoding method is shown in Figure 3. First, the melody track is used to separate the bars. Additionally, the lasting time information of each track in the MIDI file is available. The melody track lasting time is . The value is the lasting time of the i th melody track, , in the melody track set T during the bar division process. The lasting time of each bar in the same melody track is the same, and it is defined as , , which is calculated by beat. Bars are separated into [ , , , , ..., , , ..., ,| | ] according to the bar lasting time , , as shown in Figure 4. The melody is divided into bars and strictly divided according to the beat, which can be obtained from MIDI file. If bars are re-segmented, the structure of the melody will remain unchanged. The notes included in the j th bar of the melody track are defined as , . Additionally, , is a two-dimensional (2D) matrix that shows the pitch, start time, length, and intensity of the notes in each bar, as shown in Figure 5. First, the melody track is used to separate the bars. Additionally, the lasting time information of each track in the MIDI file is available. The melody track lasting time is l i . The l i value is the lasting time of the i th melody track, t i , in the melody track set T during the bar division process. The lasting time of each bar in the same melody track is the same, and it is defined as l i,B , which is calculated by beat. Bars are separated into [b i,1 , b i,2 , . . . , b i,j , . . . , b i,|b r | ] according to the bar lasting time l i,B , as shown in Figure 4. The melody is divided into bars and strictly divided according to the beat, which can be obtained from MIDI file. If bars are re-segmented, the structure of the melody will remain unchanged. The notes included in the j th bar of the melody track are defined as b i,j . Additionally, v i,j is a two-dimensional (2D) matrix that shows the pitch, start time, length, and intensity of the notes in each bar, as shown in Figure 5.  Second, the divided bars are used to perform the process of encoding the track segments within a bar matrix. When encoding is performed, a 2D matrix is used to represent the pitch of each note within a bar, as well as the start time, length, and intensity, as shown in Figure 5.
The size of the x-axis of the 2D matrix is the bar's lasting time, , , which is the unit of lasting time in the MIDI file, referred to as a Tick. The time by counting the number of Ticks can be quantified. The variable y-dimension is representative of pitch range. The size of the y-axis determines the maximum value, , and the minimum value, , of the allowable pitch range in the MIDI files. The j th bar of the melody track to be encoded is , . The melody matrix, , is the matrix storing the results of the encoding of all the bars that are included in the melody track . The melody matrix, , consists of [ , , , , ..., , , ..., ,| | ]. The bar matrix, , , defines the results that are encoded based on each bar , of the melody track . For , , the notes included in each bar , are converted to a bar matrix, as shown in Figure 5. In turn, , , , is defined as the pitch of the k th note of , . The value of , , , is set by the pitch of the kth note of , , and it is a value between and according to the allowed pitch range of the MIDI file. The , , , value is used to determine the position of the current note corresponding to the y-axis. The onset of the k th note of , is defined as , , , , while , , , is set as the start time of the kth note in the , . The , , , value is used to determine where the current note begins on the x-axis. Additionally, , , , , which is the length of the kth note of , , is defined as the note's duration time, and , , , is set as the time during which the k th note of , is sustained. , , , is then used to determine the length of the current note on the x-axis. The intensity , , , of the k th note of , is defined as the volume of the sound. Furthermore, the variable , , , sets the matrix as the intensity of the k th note of , . After the position of the current note in the measure matrix is determined, fill with , , , . Each bar is encoded as a matrix, as shown in Figure 5, and the empty spaces are filled with zeros.  Second, the divided bars are used to perform the process of encoding the track segments within a bar matrix. When encoding is performed, a 2D matrix is used to represent the pitch of each note within a bar, as well as the start time, length, and intensity, as shown in Figure 5.
The size of the x-axis of the 2D matrix is the bar's lasting time, l i,B , which is the unit of lasting time in the MIDI file, referred to as a Tick. The time by counting the number of Ticks can be quantified. The variable y-dimension is representative of pitch range. The size of the y-axis determines the maximum value, θ MAX , and the minimum value, θ MIN , of the allowable pitch range in the MIDI files. The j th bar of the melody track to be encoded is b i,j . The melody matrix, v i , is the matrix storing the results of the encoding of all the bars that are included in the melody track t i . The melody matrix, v i , The bar matrix, v i,j , defines the results that are encoded based on each bar b i,j of the melody track t i . For v i,j , the notes included in each bar b i,j are converted to a bar matrix, as shown in Figure 5. In turn, b i,j,k,F is defined as the pitch of the k th note of b i,j . The value of b i,j,k,F is set by the pitch of the kth note of b i,j , and it is a value between θ MIN and θ MAX according to the allowed pitch range of the MIDI file. The b i,j,k,F value is used to determine the position of the current note corresponding to the y-axis. The onset of the k th note of b i,j is defined as b i,j,k,S , while b i,j,k,S is set as the start time of the kth note in the b i,j . The b i,j,k,S value is used to determine where the current note begins on the x-axis. Additionally, b i,j,k,L , which is the length of the kth note of b i,j , is defined as the note's duration time, and b i,j,k,L is set as the time during which the k th note of b i,j is sustained. b i,j,k,L is then used to determine the length of the current note on the x-axis. The intensity b i,j,k,V of the k th note of b i,j is defined as the volume of the sound. Furthermore, the variable b i,j,k,V sets the matrix as the intensity of the k th note of b i,j . After the position of the current note in the measure matrix is determined, fill with b i,j,k,V . Each bar is encoded as a matrix, as shown in Figure 5, and the empty spaces are filled with zeros.  Second, the divided bars are used to perform the process of encoding the track segments within a bar matrix. When encoding is performed, a 2D matrix is used to represent the pitch of each note within a bar, as well as the start time, length, and intensity, as shown in Figure 5.
The size of the x-axis of the 2D matrix is the bar's lasting time, , , which is the unit of lasting time in the MIDI file, referred to as a Tick. The time by counting the number of Ticks can be quantified. The variable y-dimension is representative of pitch range. The size of the y-axis determines the maximum value, , and the minimum value, , of the allowable pitch range in the MIDI files. The j th bar of the melody track to be encoded is , . The melody matrix, , is the matrix storing the results of the encoding of all the bars that are included in the melody track . The melody matrix, , consists of [ , , , , ..., , , ..., ,| | ]. The bar matrix, , , defines the results that are encoded based on each bar , of the melody track . For , , the notes included in each bar , are converted to a bar matrix, as shown in Figure 5. In turn, , , , is defined as the pitch of the k th note of , . The value of , , , is set by the pitch of the kth note of , , and it is a value between and according to the allowed pitch range of the MIDI file. The , , , value is used to determine the position of the current note corresponding to the y-axis. The onset of the k th note of , is defined as , , , , while , , , is set as the start time of the kth note in the , . The , , , value is used to determine where the current note begins on the x-axis. Additionally, , , , , which is the length of the kth note of , , is defined as the note's duration time, and , , , is set as the time during which the k th note of , is sustained. , , , is then used to determine the length of the current note on the x-axis. The intensity , , , of the k th note of , is defined as the volume of the sound. Furthermore, the variable , , , sets the matrix as the intensity of the k th note of , . After the position of the current note in the measure matrix is determined, fill with , , , . Each bar is encoded as a matrix, as shown in Figure 5, and the empty spaces are filled with zeros.  The encoded bar matrix, v i,j , is integrated with the melody matrix v i . The set of the melody matrices is defined as the melody matrix set V. The encoded melody matrix, v i , is added to the melody matrix set V.
In the MIDI file, the start time and end time are clearly specified for each note. When using these data to encode the bar matrix, the start time and end time of the note will not be changed.

Experimental Objectives
We conducted several experiments including melody track extraction accuracy verification and bar-based encoding performance verification experiments. The melody track extraction accuracy verification experiments determine whether melody tracks are accurately extracted from MIDI files. Verified MIDI melody files marked with "Melody" in the TrackNameEvent attribute are used for melody track extracted verification, and these MIDI files are part of the dataset [44]. In the melody track extraction accuracy verification experiments, the level of accuracy was extracted as shown below to verify the performance of the proposed TFIDF-based melody track extraction method. First, the melody tracks were extracted from the verification MIDI file set. Second, the text parameter of the TrackNameEvent attribute, which is included in the extracted melody metatags, was checked to assess whether its value was "Melody," and the level of accuracy was determined. The bar encoding performance verification experiments assessed whether the bar information could be accurately represented by a matrix.

Experimental Data
The verification MIDI file set was used to support experiments intended to verify the melody track extraction accuracy. In the MIDI file quality verification experiment, the melody track set extracted by the melody track extraction method was encoded based on bars and used as the input data.
For the MIDI files, the data needed in the experiments were extracted by the Python MIDI library, as shown in Figure 6. The MIDI files consist of multiple tracks, and there is usually one melody track among the multiple tracks. A track has TrackNameEvent, NoteOnEvent, and EndOfEvent attributes. In Track #1, the NoteOnEvent and NoteOffEvent attributes show the start and end instants of a note. NoteOnEvent includes three types of information: Start time, pitch, and intensity. Conversely, NoteOffEvent includes duration time, pitch, and intensity, with the intensity is expressed as zero. EndOfTrackEvent indicates the end of the track. In experiments, it is assumed that the collected MIDI files are good for health and each MIDI file usually has one melody track.
During melody track extraction accuracy verification experiments, 150 of the 763 MIDI files, which the TrackNameEvent parameter value was "Melody" in all 2000 MIDI files, were used as the verification MIDI file set. One hundred and fifty melody tracks would be extracted from the verification MIDI file set. The melody tracks were extracted from the remaining 613 MIDI files and used as a comparison melody track set.
In the bar encoding performance verification experiments, the melody track set extracted by the TFIDF algorithm-based melody track extraction method was used as input. The melody track set consisted of 2000 melody tracks. As shown in Figure 6, the start time information included in NoteOnEvent is the time relative to the onset of the previous note. Because a single note is expressed as two events, NoteOnEvent and NoteOffEvent, a preprocessing stage is required before utilizing this information as the input in the quality verification experiments. Figure 7 shows the preprocessing stage. In the melody track, the start time, which includes the event logs of the melody track, is converted from relative time (red) to absolute time (black), and the start time (red), length (green), pitch (blue), and intensity (purple) of each note are contained in a single vector. During melody track extraction accuracy verification experiments, 150 of the 763 MIDI files, which the TrackNameEvent parameter value was "Melody" in all 2000 MIDI files, were used as the verification MIDI file set. One hundred and fifty melody tracks would be extracted from the verification MIDI file set. The melody tracks were extracted from the remaining 613 MIDI files and used as a comparison melody track set.
In the bar encoding performance verification experiments, the melody track set extracted by the TFIDF algorithm-based melody track extraction method was used as input. The melody track set consisted of 2000 melody tracks. As shown in Figure 6, the start time information included in NoteOnEvent is the time relative to the onset of the previous note. Because a single note is expressed as two events, NoteOnEvent and NoteOffEvent, a preprocessing stage is required before utilizing this information as the input in the quality verification experiments. Figure 7 shows the preprocessing stage. In the melody track, the start time, which includes the event logs of the melody track, is converted from relative time (red) to absolute time (black), and the start time (red), length (green), pitch (blue), and intensity (purple) of each note are contained in a single vector.

Experimental Results
This subsection describes the results of the extraction of the melody tracks from MIDI files using the TFIDF algorithm, as well as the performance of the proposed bar encoding method.  During melody track extraction accuracy verification experiments, 150 of the 763 MIDI files, which the TrackNameEvent parameter value was "Melody" in all 2000 MIDI files, were used as the verification MIDI file set. One hundred and fifty melody tracks would be extracted from the verification MIDI file set. The melody tracks were extracted from the remaining 613 MIDI files and used as a comparison melody track set.
In the bar encoding performance verification experiments, the melody track set extracted by the TFIDF algorithm-based melody track extraction method was used as input. The melody track set consisted of 2000 melody tracks. As shown in Figure 6, the start time information included in NoteOnEvent is the time relative to the onset of the previous note. Because a single note is expressed as two events, NoteOnEvent and NoteOffEvent, a preprocessing stage is required before utilizing this information as the input in the quality verification experiments. Figure 7 shows the preprocessing stage. In the melody track, the start time, which includes the event logs of the melody track, is converted from relative time (red) to absolute time (black), and the start time (red), length (green), pitch (blue), and intensity (purple) of each note are contained in a single vector.

Experimental Results
This subsection describes the results of the extraction of the melody tracks from MIDI files using the TFIDF algorithm, as well as the performance of the proposed bar encoding method.

Experimental Results
This subsection describes the results of the extraction of the melody tracks from MIDI files using the TFIDF algorithm, as well as the performance of the proposed bar encoding method. Table 2 shows the results derived from the TFIDF algorithm and the filtering stage. Three features were used to filter the tracks that were separated from the MIDI file, starting with the track that had the lowest TFIDF value based on the comparison with the melody tracks. Threshold values were found for the three features, including the average pitch, average note lasting time, and average note intensity. The purpose of providing an average value is to determine the range of the average pitch and average intensity for each melody track. During the related task of extracting melody tracks from MIDI files, the average value is often used as an important feature for analysis. It should be noted that since the pitch and intensity of the notes in the MIDI file range is from 0 to 127, the average pitch and average intensity must be between 0 and 127. The average pitch, average note lasting time, and average note intensity were extracted from the comparison melody track set to determine the threshold values for each of the features that the melody tracks had. When features such as the maximum pitch, minimum pitch, maximum lasting time, minimum lasting time, maximum intensity, and minimum intensity are used for filtering, the melody track cannot be distinguished from the other tracks. Therefore, these features have low information density in relation to the present tasks. Given that the minimum value of the extracted average pitches was 66, the pitch filtering value was 66. Additionally, given that the average lasting time was between 0.3 and 1.3, for easy comparison, here the tick was converted to seconds and the filtering value range was between 0.3 s and 1.3 s. Furthermore, given that the average strengths were above 60, the filtering value was set to 60. If a track passed through the filtering stage without engaging the filters, it was considered a melody track and was extracted. If it did not pass the filtering stage, the track was skipped, and filtering was performed on the track with the next lowest TFIDF value. If no tracks passed through the filtering stage, the track with the lowest TFIDF value was considered to be the melody track. The tracks marked with Y are the accurately extracted melody tracks, and the tracks marked with N are the inaccurately extracted melody tracks.  Table 3 shows the accuracy of the proposed TFIDF-based melody track extraction method. MIDI files were used, and 150 tracks were extracted with the proposed method. Of these melody tracks, 142 tracks, or 94.7%, were extracted accurately. The bar encoding performance verification experiments confirmed that the bar encoding method can accurately represent the four features of each notes: Pitch, start time, length, and intensity. Figure 8 shows the results of the encoding of the output of Figure 7. The encoded results are formulated within a fixed-size matrix. This matrix represents a bar consisted of four notes. The orange-and yellow-colored boxes show the results associated with playing notes at different intensities for a certain amount of time. In the note-based encoding method, the numbers of notes in the bars are different. Therefore, the encoding matrix sizes are different. Encoding matrices with different sizes cannot be used as input for the neural networks with fixed input feature dimension sizes. In the MIDI file, the start and end times are clearly specified for each note. When using these data to encode the bar matrix, the start time and end time of the note should remain unchanged.
A comparative experiment was performed between music generated based on notes and music generated based on bars. Music produced with units of bars was perceived to be of higher quality than music produced by a note-by-note method. However, this paper focuses on the preparation process, which is the preprocessing method suitable for most automatic music generation research based on the recurrent neural network. bar matrix, the start time and end time of the note should remain unchanged.
A comparative experiment was performed between music generated based on notes and music generated based on bars. Music produced with units of bars was perceived to be of higher quality than music produced by a note-by-note method. However, this paper focuses on the preparation process, which is the preprocessing method suitable for most automatic music generation research based on the recurrent neural network.

Conclusions
Comfortable music can help people relieve stress, but collecting music requires professional knowledge of music and needs a lot of time. Deep learning has had great achievements in various fields, and it has also been active in the field of automatic music generation. To help the deep automatic generation system based on deep learning generate music, which is beneficial to human health, this paper proposed new preprocessing methods that perform melody track extraction and bar encoding on MIDI file encoded music. Herein, we introduce a TFIDF-based melody track extraction method. To extract the melody track among various other tracks, the TFIDF algorithm and three types of feature-based filtering were used to accurately extract the melody track based on the pitch frequency. Furthermore, we explore a new bar-based encoding method. To express all the note information in a bar in the form of a matrix, four note features (i.e., pitch, start time, length, and sound intensity) were used to encode bars as the matrices with the same size. The 142 melody tracks were accurately extracted from 150 MIDI files by the proposed melody track extraction technique with an accuracy rate of 94.7%. The bar matrices that were encoded by the bar-based encoding method had the same size, unlike the matrices encoded by the note-based encoding methods. Correspondingly, the cost and standardization of the bar-based encoding method were verified.
In this paper, the TFIDF algorithm-based melody track extraction method and the bar encoding method were proposed to support the research of automatic music generation based on deep learning. In future research, the comparison datasets will be classified by genre or composer. By this way, features for each genre or composer will be obtained and utilizing these features can extract music more accurately; and for bar encoding, in addition to the information of the notes, it is necessary to improve the existing methods to express the rhythm and speed of the music to express more musical features.

Conflicts of Interest:
The authors declare no conflict of interest. Figure 8. Encoded bar matrix.

Conclusions
Comfortable music can help people relieve stress, but collecting music requires professional knowledge of music and needs a lot of time. Deep learning has had great achievements in various fields, and it has also been active in the field of automatic music generation. To help the deep automatic generation system based on deep learning generate music, which is beneficial to human health, this paper proposed new preprocessing methods that perform melody track extraction and bar encoding on MIDI file encoded music. Herein, we introduce a TFIDF-based melody track extraction method. To extract the melody track among various other tracks, the TFIDF algorithm and three types of feature-based filtering were used to accurately extract the melody track based on the pitch frequency. Furthermore, we explore a new bar-based encoding method. To express all the note information in a bar in the form of a matrix, four note features (i.e., pitch, start time, length, and sound intensity) were used to encode bars as the matrices with the same size. The 142 melody tracks were accurately extracted from 150 MIDI files by the proposed melody track extraction technique with an accuracy rate of 94.7%. The bar matrices that were encoded by the bar-based encoding method had the same size, unlike the matrices encoded by the note-based encoding methods. Correspondingly, the cost and standardization of the bar-based encoding method were verified.
In this paper, the TFIDF algorithm-based melody track extraction method and the bar encoding method were proposed to support the research of automatic music generation based on deep learning. In future research, the comparison datasets will be classified by genre or composer. By this way, features for each genre or composer will be obtained and utilizing these features can extract music more accurately; and for bar encoding, in addition to the information of the notes, it is necessary to improve the existing methods to express the rhythm and speed of the music to express more musical features.