Classical Music Prediction and Composition by means of Variational Autoencoders

This paper proposes a new model for music prediction based on Variational Autoencoders (VAEs). In this work, VAEs are used in a novel way in order to address two different problems: music representation into the latent space, and using this representation to make predictions of the future values of the musical piece. This approach was trained with different songs of a classical composer. As a result, the system can represent the music in the latent space, and make accurate predictions. Therefore, the system can be used to compose new music either from an existing piece or from a random starting point. An additional feature of this system is that a small dataset was used for training. However, results show that the system is able to return accurate representations and predictions in unseen data.


Introduction
Deep Learning has become an absolute revolution in art generation.Since the development of recent Deep Learning techniques, many advances have been published in this knowledge area.Most of them are focused on image generation [1], but other areas such as text generation have experienced much research [2].
In music analysis, some works have begun to arise in recent years using Deep Learning techniques that allow the generation of audio.However, the most used methods (Generative Adversarial Networks, GANs, and Linear Shot-Term Memories, LSTMs), are very complex methods that require a very high training dataset with a long training time [3] [4].Moreover, these models do not give any control to the user about the piece of music generated by the system.A system that allows the user to change the piece being generated at any time would be desirable.
Other models used for this task are Variational Autoencoders (VAEs) [5].This model combines an encoder and a decoder in order to make a transformation from a usually high dimensional space into a lower dimensional space.This new space is called latent space, and its main feature is that any point from that latent space can be decoded returning an output with sense.Therefore, any latent vector between two (or more) musical pieces will return a musical piece with the properties of those mixed.This allows the user to change the values of the latent vector in order to get a similar musical piece.This feature can make VAEs a powerful technique for music composition, in which the user has more control and can change at any moment the melody being composed by the system.
VAEs have already been explored as systems for musical analysis [6]; however, its use for composition and prediction still has to be explored.This paper shows how VAEs can be used for music prediction and therefore for music composition from a small dataset.In this work, classical music was used for training the system and, as a result, the system is able to compose melodies with the same style as the dataset used for training.Moreover, and oppositely with arXiv:1906.09972v1[cs.SD] 21 Jun 2019 other systems, VAEs allow the training with a relatively small dataset.We will show that a small number of musical pieces is enough for the system to learn a particular style of a composer.Results show that the system is able to perform in unseen data with the same accuracy as with data used in the training process.Therefore, no large datasets are needed to develop this system and to learn the style of an author.
The application of VAEs is done in a different way: usually, inputs and targets have the same value.In this case, different values have been used to allow the system to properly train to accomplish the objective of this work.Also, even VAEs have already been used in the field of music analysis, the main approaches do not use pure VAEs with dense layers; instead of it, they are usually mixed with other techniques such as Convolutional Neural Networks (CNNs) [7].This work presents an approach in which pure VAEs are used to perform classical music composition.
The rest of the paper is organised as follows: Section 2 contains a description of the most relevant works in this area.Section 3 provides a description of VAEs.Section 4 describes the method used in this work, with the descripction of the representation of the music in 4.1, the implementation of the VAE in 4.2, and the different metrics used for measuring the results in 4.3.Section 5 describes the experiments carried out in this work.Finally, sections 6 and 7 describe the main conclusions of this work and the future works that can be carried out from it.

State of the Art
Even music generation is an exciting task, up to date few works have focused on this task.These works use mainly Deep Learning techniques that allow new ways of signal processing.As new Deep Learning models have been developed, these models have been applied to music generation.
One of the most used techniques in art generation is Generative Adversarial Networks, which consist of two different neural networks: generator and discriminator.The generator is used to generate new plausible examples from the problem domain, while the discriminator is used to classify examples as real or fake.This model has been successfully applied to image generation [8].In the field of music generation, there are some works in which it has been applied, with some modifications to model the temporal structure and the multi-track interdependency of a song [9] [10] [3].
Other works are focused on Long Short-Term Memory (LSTM) networks, which are networks with recurrent connections broadly used for temporal processing [11].Simple LSTMs were used for musical transduction [4] to implement a pitch detection system.In another work, LSTMs were combined with VAEs for music generation [12].In this work, 2-layer LSTMs with 1024 neurons per layer were used for the decoder.
Autoencoders (AEs) have also been used as models in the field of music analysis.For instance, they were used to synthesize musical notes, having as input raw audio instead of pitches [13].Thus, the focus of this work was not to generate music compositions but independent raw musical notes.
VAEs have also been used for music-based tasks.For instance, in [6] and [14], VAEs were used for timbre studies.In this work, the Short-Term Fourier Transform (STFT) in combination with a Discrete Cosine Transform (DCT) and the Non-Stationary Gabor Transform (NSGT) to preprocess the audio before the application of the VAE.In another work, Convolutional Neural Networks have been used to extract features.However, its main goal was not to compose music, but to perform audio-to-MIDI alignment, audio-to-audio alignment, and singing voice separation [7].
In another work, VAEs are combined with LSTMs used as encoders, and recurrent neural networks (RNNs) as decoders [12].In a recent publication, VAES were used for sound modeling [15].This work makes a comparison of different models, including VAEs, However, this work works with raw audio instead of MIDI files since its objective is to model sound and not music generation.VAEs have been used also for processing of speech signals with the objective of making modifications in some attributes of the speakers [16] [17] 3 Variational Autoencoders Variational Autoencoders arose from the evolution of autoencoders [18] [5] [19].Both techniques aim to codify a set of data into a smaller vector, and then reconstructing the original data from this vector.In AEs, the set of data points in the input space is mapped to an smaller vector, which is a point in a new space with fewer dimensions called latent space.Usually, Multilayer Perceptrons (MLPs) are used for the codification and decodification tasks.This approach allows for interesting tasks such as information compression.However, in this latent space other points different than those obtained from an input do not usually correspond to an output with significance in the original space.
That is the reason why VAEs try to build a latent space, in which all of the latent vectors corresponding to inputs are close to each other, and the points in the space between them correspond to outputs that have a significance in the original space.VAEs allow powerful representations while being a simple and fast learning framework.
The objective of Variational Autoencoders is to find the underlying probability distribution of the data p(x), where x is a vector in a high dimensional space.In a lower-dimensional space z, a set of latent variables are considered.The model is defined by the probability distribution The function p(x|z) represents a probabilistic decoder that models how the generation of observed data x is conditioned on the latent data z.The function p(z) represents the probability distribution of the latent space and is usually modeled by a standard Gaussian distribution.
The function p(x|z) can be approximated with a model q(z|x).This model works as an encoder, and, for a specific x, emits two latent vectors, µ and σ, that represent the mean and standard deviation of the Gaussian probability for that data.
The optimization problem consists on minimizing the Kullback-Leibler (KL) divergence between the approximation and the original density.Using the Bayes' rule, the expression to minimize is the following: As it can be seen, the system is based on two parts: q(z|x) encodes the data into the latent representation, and z p(x|z) is a decoder, generates data x from a latent vector z.Since p(z) does not depend on q(z), this equation can be rewritten as the following According to Eq.4, the objective can be changed into maximizing the marginal log-likelihood log p θ (x) over a training dataset of vectors x.This value to maximize is called the evidence lower bound (ELBO) and can be written as: In Eq.5, φ denotes the parameters of the encoder and θ the parameters of the decoder (weights and biases); D KL is the Kullback-Leibler divergence, non-negative, and L(φ, θ, x) is the variational lower bound.This value is calculated by the following equation First term of Eq.6 represents the average accuracy obtained by the system when using an approximate q instead of p.
The second term represents the error made by using q(z|x) instead of p(z) and allows to regularize the approximation q to be close to the true distribution.
Many times a β value is introduced in the second term, leading to the β-VAE formulation.This modification can lead to having better results [5]; however, some works suggest a modification of this parameter through the training process [19].This term allows the control of the trade-off between output signal quality and compactness/orthogonality of the latent coefficients z.
The first term of the previous equation is approximated with the average value of the calculation of log p(x|z l ), where for each data x in the training set z l are samples from the distribution q(z|x).Thus, the calculation of this term becomes The values z l are taken from the distribution N (µ(x), σ(x)), where µ(x) and σ(x) are the ouputs from the decoder with x as input.

Representation
The model proposed in this paper uses a representation of the music with the shape of a binary matrix M with dimensions nxt, where n is the number of pitches and t is time.In this work, a time step of 100 milliseconds was used.Therefore, M ij = 1 for those moments j in which a pitch i is being played, and 0 when that pitch is not being played.
In this codification, the velocity (i.e., the volume) of the note events has not been taken into consideration.
This matrix is built from a MIDI file that contains the notes and the duration of each one.From these files, the notes are read and the matrix M is built.A note with a pitch i that begins in the instant t and has a duration of 100 * d milliseconds will be situated in row i, and in the columns from t to t + d.In the case of having two consecutive overlapping notes with the same pitch, the matrix M will have, on that note, a series of 1s from the beginning of the first note to the end of the second, with no distinction of being one long note or two different overlapping notes.To make this distinction, the value of M at the moment previous to the second note is set to 0.
To make the dataset, 14 different compositions of Handel were used.Although it may seem a very low number, one of the objectives of this work is to show that this system can learn the representation into the latent space with a small dataset, and the resulting model can learn correctly the style of an author.These compositions were codified into binary matrices as described before.

Autoencoder implementation
Once the matrices were obtained, the inputs and targets of the VAE were elaborated.In a traditional approximation, the inputs and targets of the VAE are the same vectors.However, in this particularly case, a different approach was used.The inputs were a time window of T seconds of the matrix M , i.e., a set of consecutive columns reshaped as a vector.
An overlapping of T − 1 seconds was used for building different inputs from the same matrix.In this case, the targets for these inputs were not the same vector.Instead of it, the target for each input was the following input.Since an input and the following one had a difference of 1 second, T − 1 seconds of the targets correspond to values in the inputs, and 1 second of the targets are new musical notes.
With this approach, we want the VAE not only to learn the dependencies between notes that allow making a representation in the latent space, but also the dependencies with the next second of music.Therefore, VAE is aimed to solve two different problems: music recomposition and music prediction.Once T seconds of music are codified into a latent vector, the decodification of this vector returns the following T notes with overlapping of T − 1 seconds.Therefore, with this VAE trained, the generation of new musical compositions is very simple.First, T initial seconds of music are taken.These seconds may be some already existing seconds from any composition or the decodification of any random vector in the latent space.Then, these seconds are used as inputs to the VAE to generate T seconds in which the first T − 1 are overlapping.Thus, 1 second of new music is generated.These T seconds can be used again as inputs to generate a new second, and therefore a loop is built in which it can have as many iterations as seconds are needed.
It is important to bear in mind that this system does not compose songs with a structure.Instead of it, it is able to complete a composition in those parts in which the music may be missing.If the system is used for this purpose, then the T initial seconds will be an existing part of the composition, and the system will try to continue this composition in the same way as Handel would have done.
Moreover, the part given by the system is not unique.Different parts can be returned if the vector is modified in the latent space.In this sense, this system allows the composition of different melodies.
Since a VAE is used, a loss function has to be given.This loss function was defined in section 3.In this implementation, sigmoid functions are used in the output layer, and therefore the system returns values between 0 and 1.In order to build music representation as a boolean matrix as described in section 4.1, a threshold must be used.Experimentation with different thresholds has to be done in order to find one the returns the best results.

Performance measures
In order to measure the behaviour of this system, the outputs generated by this system, after the application of the threshold, must be compared with the targets.Since both outputs and targets are boolean values, this comparison can be done by means of Accuracy (ACC), Sensitivity (SEN ) and Predictive Positive Value (V P P ).
These values can be calculated from a confusion matrix.This matrix is built from 4 values: • True Positives (T P ) is the number of pitches and time steps correctly played.• False Positives (F P ) is the number of pitches played in a time step in which there should be silence.
• True Negatives (TN) is the number of pitches and time steps in which no notes are played and there should be silence.• False Negatives (F N ) is the number of pitches and time steps in which no notes are played but there should be played.
From these values, ACC, SEN and V P P can be calculated with the following equations [20]: V P P = T P T P + F P (10) Accuracy (ACC) represents the ability of the model to play values adequately.Sensitivity (SEN ) represents the ability to play the true notes, even some additional notes might be played.Predictive Positive Value (P P V ) is the probability of the system that the notes played correspond to true notes, even if some true notes are left to be played.
Since SEN and P P V are important measures, a good trade-off between SEN and P P V is needed.These values, as well as ACC, highly depend on the threshold chosen.A low value on the threshold leads to having a high number of notes played, even if many of them do not correspond to the original piece.A high value of the threshold corresponds to having a low number of notes, but with a high probability that they belong to the original piece.
Many times, these two measures (SEN ,P P V ) are summarized in a single metric called F 1 − score.This metric is with the harmonic mean of SEN and P P V metrics and it is usually better than accuracy on imbalanced binary data, as is the case of this dataset [21].

Experiments
In order to develop the system described in section 4.2, different experiments were carried out to set the values of the different parameters.As it was already said, the system tries to predict 1 second of music from T − 1 seconds of music.Therefore, this value of T is an important parameter.Low values doe not give enough information to predict the music.On the other hand, too large values may give too much information and make the training process too slow, and overfit to the training set.
The experimentation was performed with values of T = 2 to T = 10 seconds.Additionally, a value of T = 1 was also chosen, in order to make predictions of 0.5 seconds instead of 1 second.
With respect to the architecture of the network, in all of the cases, the decoder was a MLP with one hidden layer and the decoder had the same architecture.An important parameter is the number of neurons in the hidden layers.Experiments with 500 and 750 neurons were performed.
Another important parameter was the dimension of the latent space.In this sense, values of 100 and 200 were chosen for this parameter.Finally, a value of β = 0.5 was used in these experiments.
The dataset was divided leaving the 20% of the data for the test.This 20% for the test was chosen to be in compositions different from the training set.
The system was trained with all of the parameter configurations described.For each configuration, different thresholds were used in order to select the best threshold and configurations.Figure 1 shows the results obtained for each configuration.The figures on the left show the F 1-scores and accuracies obtained with the best threshold for each configuration on the training set.The figures on the right show the F 1-scores and accuracies obtained on the test set with the thresholds given by the left figures.As it can be seen, accuracies and F 1-scores are very high on both training and test sets.From this figure, a configuration with a window size of 9 seg, 750 hidden neurons and a latent dimension of 200 was chosen since it returned the best results on the training dataset.
Figure 2 shows the results obtained with a different threshold for this configuration.This figure shows Sensitivity, P P V and F 1-score on training and test sets.As can be seen, low threshold values lead to having high sensitivity and low P P V .On the other side, a high threshold leads to having low sensitivity and high P P V .Therefore, a trade-off between these two values is needed.For this reason, F 1-score was used.As a result, the value with the highest F 1 − score in the training set was used as a threshold, being this value 0.41.The right plot shows the sensitivity, P P V and F 1-score for the test set.
As these figures show, the test results are very close to the training results, even when having a small dataset, and very low overfitting is observed in these graphs.This leads to the conclusion that the system has learned the features from a small set of compositions from this author, and the representation of this pieces is robust and can be applied to new compositions from the same author.
However, in this work, the application of VAEs is performed not only to codify and reconstruct a part of a composition, but also to predict the next second of the composition.Therefore, it can be seen that two different problems are studied here: reconstruction and prediction.Figure 3 shows the F 1 scores and accuracies obtained for the selected configuration and threshold value, measured separately for the 8 seconds reconstruction and the 1-second prediction.As was expected, the F 1 score and accuracy of the prediction are lower than those of the reconstruction.However, it is slightly lower, with a small difference.This figure shows training results on the left and test results on the right.Therefore, the system shows a better behaviour when making a prediction of a musical part from this author.According to the results shown in Figure 3, the prediction of music is very accurate to the real music.In this sense, it is interesting to see if this prediction is better or worse as the prediction time is further from the last known moment.Figure 4 shows the F 1 score and accuracies obtained for the different moments in the one-second prediction.As a timestep of 100 milliseconds was used, this figure only has 10 values n the x-axis.As it can be seen in this figure, F 1 score and accuracy seem not to be influenced by the moment of the prediction until it reaches the one second limit.At this moment, the prediction drops.

Conclusions
This work shows the possibility of using VAEs for music analysis to solve two different problems: music codification and reconstruction, and music prediction.Two different metrics have been used, and, as the results show, both problems have been successfully solved since the results are very high in both metrics.As could be imagined, the prediction problem shows worse results than the reconstruction problem.However, these results are only slightly worse.This means that the learned features can make accurate predictions.
Therefore, it can be concluded that the features learned from the VAE correctly represent the style of the author.Moreover, this system was developed so that the decodification of the features corresponding to a specific piece leads to having the following piece of music.As this next piece of music can be codified and decodified into the following and so on, a composition can be represented in the latent space as a trajectory between different vectors in this space.Further analysis of these trajectories can give new insights about the composition and style of different authors.
The development with a small dataset is one of the most prominent features of this work.Although this could be considered as a big drawback for the training of the VAE, test results are comparable to training results.Therefore, the systems behave correctly with unseen pieces of music, returning their features in the latent space, and giving a prediction of the following second.

Future Works
From the work presented in this paper, different directions can be taken.First, in the modeling of the audio, the velocity (volume) has not been taken into account.A new model could be developed in which the output of each neuron, a real value between 0 and 1, can be interpreted as the volume of that pith in that specific moment.
As was already explained in section 6, a musical composition can be represented as a trajectory in the latent space.This system could be trained with compositions from different authors.When high enough accuracies were obtained in both  training and test sets, these trajectories could be analyzed in order to discover the differences between authors.This could be mixed with a clustering technique to discover interesting patterns in music composition.
Finally, as shown in section 5, the prediction seems to drop when it is getting to 1 second.Further studies should be carried out in order to find out if this prediction can be improved with bigger network architecture.

Figure 1 :
Figure 1: F1 score and accuracy for each configuration

Figure 2 :
Figure 2: Sensitivity, V P P and F 1-score for the different threshold in the selected configuration

Figure 3 :
Figure 3: Comparison of the results in the reconstruction and prediction problems

Figure 4 :
Figure 4: Results in prediction on different moments