Generating Music Transition by Using a Transformer-Based Model

: With the prevalence of online video-sharing platforms increasing in recent years, many people have started to create their own videos and upload them onto the Internet. In ﬁlmmaking, background music is also one of the major elements besides the footage. With matching background music, a video can not only convey information, but also immerse the viewers in the setting of a story. There is often not only one piece of background music, but several, which is why audio editing and music production software are required. However, music editing is a professional expertise, and it can be hard for amateur creators to compose ideal pieces for the video. At the same time, there are some online audio libraries and music archives for sharing audio/music samples. For beginners, one possible way to compose background music for a video is “arranging and integrating samples”, rather than making music from scratch. As a result, this leads to a problem. There might be some gaps between samples, in which we have to generate transitions to ﬁll the gaps. In our research, we build a transformer-based model for generating a music transition to bridge two prepared music clips. We design and perform experiments to demonstrate that our results are promising. The results are also analysed by using a questionnaire to reveal a positive response from listeners, supporting that our generated transitions conform to background music.


Introduction
In the past, people used cameras and camcorders to take pictures of things in our lives.Nowadays, along with the advances in technology, mobile devices have become more prevalent, and many people use smartphones instead of cameras and camcorders to keep records of details in lives.These recorded images can not only be put on Facebook, Instagram, and other social media platforms, but can also be made into videos and uploaded to YouTube and other video sharing sites for people to share their daily lives with others.
In the face of the current trend of photography and video creation, much video editing software is available on the market for editing video materials, such as PowerDirector, iMovie, and Quik, which allows users to produce a video in a short time through simple operations.During production, not only do we need the filmed materials, we also need suitable background music.Many Internet platforms offer free soundtracks, but most of them are monotonous and repetitive.If we want the music to match the changing plots of a film, we need to select multiple pieces of music and edit them ourselves.However, music editing requires domain knowledge, which is a challenge for amateurs and the general public to produce materials that meet professional levels.
With the rise of Artificial Intelligence (AI) in recent years, people have easy access to professional knowledge and technology in music creation.Some companies and scholars have developed AI-based algorithms for automatic composition in which they are able to customize their music according to the user-specified musical styles, lengths, and beats.Examples can be found on platforms such as Jukedeck and AIVA.Some others leverage a small piece of user-given music to enable AI-based algorithms to generate subsequent parts, such as Music Transformer and MuseNet.Our project (sponsored by the Ministry of Science and Technology, Taiwan, under Contract No. MOST-108-2221-E-030-013-MY2) makes use of transformer-based algorithms to build a score generation system for short videos that allow the public to easily create their own video background music.

Research Background
In our MOST Project (MOST-108-2221-E-030-013-MY2), we design a soundtrack generation system for a short video.The proposed system consists of two main parts: video analysis and soundtrack generation.
In the video analysis, the video provided by the user undergoes several image analyses, and the features are extracted for subsequent soundtrack generation as shown in Figure 1.At the beginning of the process (x in Figure 1), the system divides the user-given video into multiple shots by shot detection approaches, then finds the key frame (y in Figure 1) from the segmented shots, such as KF1, KF2, and KF3.From these key frames, the system finds the corresponding tags (z in Figure 1), which consist of descriptive words, such as "style", "tone", "emotion", "feeling", etc. (e.g., Tag1 and Tag2 in Figure 1).Image analysis technology is also used to formulate the pattern for editing, which makes the event tags (e.g., Event 1 and Event 3 in Figure 1).Users can also add other tags during the process as they wish.Examples are tags for "playing the music of a specific theme at the time point of T 1 " (e.g., 'Theme' in Figure 1) or tags for "maintaining silence at the time point of T 5 " (e.g., 'Silent' in Figure 1) After video analysis, the system will retrieve all the tags and organize them as a sequence of tags associated with time stamps, where the element of each sequence is composed of time stamps and tag sets (T n , {tag, . . .}).The following is an example of a sequence of tags: Video → {(T 1 , {Tag1, Tag2, Theme}), (T 2 , {Event1}), . . ., (T 8 , {Theme})} In soundtrack generation, the matching music clips are selected from the pool, and these clips are combined into a complete soundtrack to go with a short video for the user.The process is shown in Figure 2.
In this system, we build a music pool, and each music segment in the pool will be analysed beforehand to associate with a series of related tags.The corresponding tags and music clips will be maintained in the pool for later use in soundtrack generation.
At the beginning of soundtrack generation ({ in Figure 2), we perform the "match process" from the pool based on the tag sequences recorded by the video analysis and find candidates of music segments to be arranged on the timeline according to the time stamps in the sequences.Next, music fusion (| in Figure 2) is conducted using these candidate segments.The fusion steps include "resolving conflicts between candidate segments" (e.g., MS 2 v.s.MS 4 and MS 5 v.s.MS 3 in Figure 2) and "creating music transition" (e.g., MS 1 vs. MS 2 and MS 3 vs.MS 7 in Figure 2).Lastly, we add the fused single-track score with chord and orchestration, coupled with arrangement and audio mixing (} in Figure 2) to make a multi-track score that completes the whole short-video soundtrack.

Research Objectives
In this paper, we focus on the "create music transition" process in the framework through the use of AI algorithms.Given two music segments, say MS 1 and MS 2, in the symbolic sequences, we design a transformer-based model to generate a music transition sequence to bridge MS 1 and MS 2.

Related Work
In this section, we introduce the related research topics, including MIDI communications protocol, automatic composition algorithms, and the transformer deep learning model.

MIDI
The Musical Instrument Digital Interface (MIDI) is a technical standard for electronic communications protocol that allows electronic instruments and computers to coordinate the playing of music through codes.MIDI files are composed of a header chunk followed by one or more track chunks.The header chunk records the basic data of the entire file, while the track chunk records the data of the music played through three types of events, including the MIDI event, sysex event, and meta event.In the MIDI event, there are seven functions to control the sound playback to correspond to the movement of the instrument as shown in Table 1.

Automatic Composition
In Fernández's paper [1], algorithms for automatic music composition are categorized into four main groups, Symbolic AI (grammars and rule-based systems), Machine Learning (Markov Chains and Artificial Neural Networks), Optimization Techniques (Evolutionary Algorithms), and Complex Systems (self-similarity and cellular automaton), as shown in Figure 3.
Symbolic AI applies music syntax to generate music.In earlier days, music syntax was obtained by humans from music theory or existing scores, and after the 1980s, some proposed computational approaches to extract music syntax.After obtaining the rules of syntax, it is possible to automatically generate music through algorithms such as the L-system [2-5] or Evolutionary Algorithms [6][7][8][9], and also to build a rule-based system [10] for generating music.Machine learning uses extensive existing music data to learn music with algorithms, such as Markov Chains [11][12][13] or Artificial Neural Networks [14][15][16], which has been popular in recent years.Evolutionary Algorithms are often used in Optimization Techniques where the fitness function selects the best candidate among many after several generations of cycles.The fitness function in Evolutionary Algorithms can conduct automatic determination using rule-based methods, artificial neural networks [17,18] or interactive filtering based on feedback from the test subjects [19,20].In Complex Systems, it was found that the sound signal composed of Pink noise (1/ f noise) sounded better as in musical syntax.Researchers used the self-similarity to generate music [15], and some others created music through Cellular Automaton (CA) [21,22].

Transformer
Transformer is a deep learning algorithm proposed in Vaswani's paper [23] for text translation, and has been widely used in natural language processing (NLP) in recent years.
The Transformer is formed by an encoder-decoder architecture and uses the selfattention mechanism in the model.Figure 4 illustrates the Transformer model.When the input sequence is first entered into the Transformer, it goes through an embedding layer, which transforms each one-hot token in the input sequence into vectors of other dimensions.The converted input sequence is then processed with position encoding before entering the Encoder or Decoder.In this way, the position difference among tokens can be identified in the input sequence when self-attention is carried out.After this, the input sequence enters the Encoder or Decoder for computation.The Encoder (as shown in Figure 5) consists of two sub-layers, the multi-head attention layer and the position-wise feed-forward layer, where the multi-head attention is the main part in which the Transformer carries out self-attention computing.Upon entering the multi-head attention layer, the input sequence is linearly transformed into multiple sets of queries (Q h for Query, , and values (V h for Value, V h = W V h X) based on the number of heads.Each group of Q h , K h , and V h computes the attention between each token via scaled dot-product attention.The equation for the scaled dot-product attention computing is as follows: The A h is the result from the scaled dot-product attention computing, and d is the dimension size of Q h , K h , and V h (to avoid large variance generated by Q h and K h dotproduct computing, d is used for scaling the result of dot-product computing).After computing, multiple sets of results, A 1 , A 2 , . . ., A h , will be concatenated, and the final output from the multi-head attention layer will be calculated via a linear transformation.The Decoder (as shown in Figure 5), consists of three sub-layers, the masked multi-head attention layer, multi-head attention layer, and position-wise feed-forward layer.The masked multi-head attention layer, which does not exist in the Encoder, features masking that prevents the Decoder from viewing the tokens in advance during training.Therefore, the token after the predicted target is masked, as shown in Mask (Opt.) in Figure 6.In the Encoder and Decoder, the position-wise feed forward layer is composed of two feed-forward layers, which are connected after the multi-head attention layer.The sub-layers of the Encoder and the Decoder are connected using layer normalization and residual connection.The experimental analysis of Vaswani [23] found that the training speed of the Transformer is faster than that of recurrent layer and convolution layer, and the hidden layer in the Transformer can retain more information with the help of self-attention computing, which makes the training result from the Transformer much better than that from the RNN.
In addition to the Transformer [23], this model has been applied in various fields.In music, for instance, Huang [25] proposed a Music Transformer, which uses a specified music sequence to generate subsequent music sequences.In text generation, Dai [26] proposed Transformer-XL, which can break the length limit of the Transformer [23] to generate longer texts.In terms of image generation, Child [27] proposed Sparse Transformers, which allow Transformers to process 2D data in addition to 1D data and adjust the attention mechanism to reduce the volume of model parameters for computation.
In reference to Figure 3, there are some possible approaches to developing specific methods of generating music transitions, such as a rule-based system, evolutionary algo-rithms, and machine-learning-based algorithms.When applying a rule-based system or evolutionary algorithms, we need domain expertise of music theory to construct rules or fitness functions.However, machine-learning-based algorithms would be data-driven, in which limited domain expertise is involved.Meanwhile, the transformer is a promising method for sequence generation.We believe that the machine-learning-based methods, including our approach, could be more friendly to beginners and be easily applied for rapid development in the early stage.Certainly, in the following stages, domain expertise should be always appreciated and involved to polish and refine our method to achieve more pleasant/delightful/attractive music transition sequences.

Method
In this section, we introduce the data processing and model construction used in the study.Figure 7 illustrates our framework of proposed methodology.The upper part of the framework is designed for the training and validation process to build the Transformerbased model.After finishing model construction, in the test phase, the input data are two music segments, namely preceding data and following data; the output data is music transition sequence (MTS) to bridge the two user-given segments.

Data Pre-Processing
This section explains the music data representation method which converts music MIDI files into the specific data formats, and then properly organizes them as a training dataset for model construction.

REMI
In this paper, the REMI representation proposed by Huang [28] is used to convert music MIDI files into the REMI-defined data format.Before the REMI representation was made public, most of the recent studies used the MIDI-like event representation proposed by Oore [29], which converts musical MIDI files into a data format compatible with the MIDI-like event representation to train models.The MIDI-like event representation converts the MIDI events in the MIDI file into 4 corresponding token events, which are, respectively, Note-On (trigger note play), Note-Off (end note play), Note Velocity (note play force), and Time-Shift (time difference between token events).
Huang's study [28] found several features of MIDI-like event representations; for example: MIDI-like event representations convert music MIDI files in a way that features faithful representation of keyboard-style music (e.g., piano music).In this way, the con-verted data lack the high-level musical information such as Downbeat, Chord, and Tempo presented in the score.Huang [28] also mentioned in their paper that "We note that when humans compose music, we tend to organize regularly recurring patterns and accents over a metrical structure defined in terms of sub-beats, beats, and bars."[28] (p. 2), therefore they proposed REMI to resolve the issues of MIDI-like event representations.
In reference to Table 2, REMI representation retains the original MIDI-like event representation's Note-On and Note Velocity, and adds token events that include Note Duration, Position & Bar, Tempo, and Chord.The newly-added Note Duration replaces Note-off in the MIDI-like event representation.In the MIDI-like event representation, the playing of a note is based on its Note-on, Note-off, and the Time-Shift accumulated in between, which often contains many other token events between Note-On and Note-Off (in Huang's practice [28], there is an average of 21.7 ± 15.3 token events in between).REMI representation only needs to use Note-On and the adjacent Note Duration to determine a note's beginning and ending for a model in training to easily learn this feature.REMI representation replaces Time Shift in the MIDI-like event representation with the newly added Position & Bar.Time-Shift in MIDI-like event representation marks the time difference between token events.Huang's study found that training models could not use Time-Shift to generate music with a steady beat, and Huang attributed this to the lack of metrical structure in MIDI-like event representation.Therefore, in REMI representation, Position & Bar is used to represent token events' absolute positions on the score.In subsequent experiments, they also found that the models could easily learn the dependency of note events on the same Position through Position & Bar.Lastly, REMI's newly added Tempo and Chord supplement higher-level musical data which MIDI-like event representation lacks.In reference to Figure 8, we illustrate an example of REMI representation.

Data Processing
The goal of this study is to generate a musical transition sequence (MTS), in which the generated MTS must refer to its preceding music sequence and the following one.The preceding music sequence and the following music sequence are the input for the training models, with their output as MTSs.In data processing, the music MIDI files are converted into REMI format and segmented into units every n number of tokens.In this paper, the value of n is set to 256 (based on NVIDIA GeForce RTX 2080, the equipment used in this experiment).In the experiment, every three consecutive units form one sample of training input and the corresponding output data.In reference to Figure 9, the sliding window is designed to cover three units, and the window moves one unit forward at a time to generate the next sample for the training.

Deep Learning Framework
In this paper, we use a transformer-based model to generate music transition sequences (MTS).A preceding music sequence and a following music sequence are used as the input to generate an MTS that bridges the two musical sequences.In recent years, Vaswani [23] proposed the transformer deep learning model, which not only received good results in NLP, but also is widely applied in other fields.In the music field, Huang [25] also successfully used music transformers to generate subsequent music sequences with a small segment of sequence.In this study, we also applied a transformer-based model to solve the challenges when it comes to MTS.
The structure of the model referred to the Encoder-Decoder structure used by Vaswani [23] in the transformer learning model.Since the goal of this paper is to generate an MTS between two music segments of sequences, our framework, consisting of two encoders and one decoder, is applied to build the transformer-based model, as shown in Figure 10.By using two encoders to capture key information from the preceding music sequence and the following one, the decoder might generate the suitable/appropriate MTS with reference to the features from the two music sequences.
In our model, we also make use of the positional encoding used in Transformer-XL [26], which is different from the one used in the Transformer [23].In the Transformer, positional encoding is applied to enhance the positional information between tokens, so that the input sequence can recognize the positional difference between tokens when performing self-attention computation.Sinusoid values of different frequencies are applied to the input sequence, which can be seen as an implant of absolute positional information into the input sequence.After Shaw's [30] paper was published, subsequent studies of transformers [25,26] started to employ relative positional encoding in the models.In the study [25,26], the training error of using relative positional encoding is smaller than that of using absolute positional encoding.As a result, in this paper, we also employ relative positional encoding in the transformer-based model used in the experiment.
In the transformer-based model in this paper, the encoder is the same as that used in Transformer [23], which consists of a multi-head attention layer and position-wise feedforward layer, and these layers are connected through layer normalization and residual connection.The decoder, on the other hand, adopts the one from the Transformer for adjustment.As shown in the Decoder part in Figure 11, the sub-layer at the bottom adopts the masked multi-head attention layer from the Transformer [23], and after it, two consecutive multi-head attention layers are used.One of them is responsible for receiving the encoder information from the preceding music sequence of the MTS, and the other for the one from the following music sequence.Lastly, the results from the two multi-head attention layers are concatenated to go through a feed-forward layer.In the Decoder part, (masked) multi-head attention layers are connected to each other using layer normalization and residual connection.Only the concatenated feed-forward layer in the end of the process is connected to the masked multi-head attention layer in the beginning using the residual connection.Figure 12 shows our framework of transformer-based model.

Sampling
In the framework, the trained transformer-based model can be used to only predict the probability of the next token.To generate a sequence consisting of tokens, we apply the temperature-controlled stochastic sampling method with top-k [31] to determine consecutive tokens.As shown in Figure 13, the token probabilities predicted by the model undergo temperature sampling, before top-k is used for the selection of the next token.

Loss Function
In our framework, KL divergence defined in Equation ( 2) is used as the loss function.When processing data, the prediction target is converted into one-hot encoding.Therefore, when training the model, the output is the probability of each class.Thus, this paper uses KL divergence to measure the degree of difference between two independent probability distributions with the following equation: T(x) is the probability distribution of the predicted target, and P(x) is that of the model output.By calculating the KL divergence, we can obtain the decreased expected value that has changed from the probability distribution of the predicted target to that of the model output, which can be used to adjust the model parameters and improve the accuracy of the training model.

Optimizer
This section describes the optimizer used in the model training process.

Adam Optimizer
The training process uses the Adaptive Moment Estimation (Adam) as the optimizer in the experiment.Adam is one of the commonly used optimizers in deep-learning models, combining the advantages of momentum and RMSprop, as shown in the following equation: In the equation, m t serves similar purposes as momentum, which is used to adjust the amount of corrections made to the model.The v t , on the other hand, works as RMSprop, which is used to adjust the learning rate in a dynamic approach in the optimizer according to the gradient of loss in the model.β 1 and β 2 in the equation are the degrees of decline of m t and v t , and g t is the gradient of loss in the training model.Equation (3c) and (3d) are used to keep m t and v t from leaning toward 0 in the early stages of model training, which would lead to excessive model correction and scattered training results.ε in Equation (3e) is a parameter that keeps the denominator from being 0, while γ is the learning rate set by Adam.

Warmup
Vaswani [23] found in the course of the experiment that the model gradient changes greatly in the transformer's early stages of training, which may cause Adam to update too many of the model parameters in the early stages of training, possibly resulting in scattered training results.For this reason, Vaswani used warmups to assist Adam when training the transformer.When using warmups, there are warmup steps; when the number of training sessions is smaller than that of the warmup steps, the learning rate of training will be increased gradually with reference to the equation, to a point when the number of training sessions is larger than that of the warmup steps, before the learning rate is gradually dialed down, as shown in Figure 14.The equation for warmup steps is as follows: lrate is the learning rate used by the optimizer, and d model is the dimension size set by the transformer in the embedding layer.step_num stands for the number of epochs during the training, and warmup_steps is the value set for warmup steps.Vaswani [23] used warmups to avoid scattered training results in the early stages using a transformer, as well as to minimize the gradient of loss.

Regularization
This section explains the regularization used to improve the result accuracy from the training model.

Label Smoothing
Label smoothing is used during model training to reach a prediction target with soft one-hot encoding, so that the output of the model can be adjusted to the prediction target.Label smoothing prevents the model from generating over-confident results, and the equation of adopting label smoothing is as follows: y c is the value of the predicted target for the c-th class, and α is the parameter for the smoothing.When α equals 0, we have the value of the original prediction target, while when α equals to 1, the distribution of the prediction target will be uniform.Label smoothing first appeared in Szegedy's paper [32], and although label smoothing raises the uncertainty for the model and reduces the accuracy during training, it improves the accuracy during validation.Vaswani's paper [23] used label smoothing to raise their scores in BLEU.Note that BLEU (bilingual evaluation understudy score) is a metric for evaluating the quality of text which has been machine-translated from one natural language to another.BLEU indicates how similar the candidate text is to the reference text, with values closer to one representing more similar texts [33].This paper refers to the label smoothing used in Vaswani's paper [23] and sets the value of parameter α as 0.1 to improve the accuracy during validation.This also allows the model to obtain better results when using the temperature-controlled stochastic sampling method with top-k [31].

Dropout
One of the challenges in machine learning training process is over-fitting.In past research, some used model combination as a solution, but this method requires several different models for training, and the large amount of training data as well as computational costs make it even more difficult to be adopted.In 2014, Hinton [34] proposed Dropout, a training method similar to model combination.Dropout rules out some units in the neural network and trains the model with a new thinned neural network instead.The loss resulting from the thinned neural network during the training will be updated by back-propagation to generate a modified thinned neural network.After the update, the model retrieves the units that have been removed earlier to restore the original neural network.The above process will be repeated until the training is finished.The following is the equation of the neural network before Dropout is adopted: After Dropout is adopted: l is the position of neural network hidden layers; z i (l) is the output of the i-th unit of Layer l; f is any activation function; y i is the output of the i-th unit of Layer l after going through the activation function f .b (l) i and w (l) i are the bias and weight of the i-th unit of Layer l, and r (l) j is the vector from Bernoulli random variables which generates 1 with probability p.
Using Dropout in a neural network is like sampling multiple thinned neural networks from the original network.When there are n units in the neural network, we can sample 2 n kinds of thinned neural networks.That is, when training a neural network using Dropout, it can be seen as training a thinned neural network with 2 n kinds of sets, just like the model combination explained above.In this paper, Dropout is used in the model, and the probability of Dropout is set to 0.1.

Experiment
In this section, we introduce the data source and evaluation in our experiments.

Data Source
We have two datasets, namely pop music and classical music.In our experiment, we train two different model using the two datasets separately.The dataset used for pop music is the training dataset provided by Huang [28], which contains 775 pieces, including MIDI files of Western pop, Korean pop music, and music from Japanese anime.The dataset used for classical music is the recorded MIDI files of 290 pieces played by the contestants in the Piano-e-Competition in 2018.Before the deep learning model was trained, the datasets of each music style were divided into 80/10/10, which represent 80% training data, 10% validation data, and 10% test data.

Evaluation
In the experiments, we trained two sets of model parameters (stack for six layers; stack for three layers) separately on the pop and classical music datasets.At the end of the training, we compared the results by verifying the KL Divergence loss from the datasets, and the results are shown in the tables below.In the results for the two styles, the divergence loss in the model with a stack for six layers is the minimum.Table 3 shows the experimental results with the pop dataset, and Table 4 the classical dataset.

Listening Test
In addition to the data analysis shown in Section 4.2, we also designed a listening test to evaluate the results generated by the model.We came up with a total of 20 questions in the listening test in the form of an online questionnaire, with ten questions each for the pop and classical genres.Each test included a piece of music, paired with a question.Before the test, the subjects were asked whether or not he/she is a music professional who understands basic music theory and has played a musical instrument for more than six years.At the beginning of the test, the subjects must first listen to the music before they answered the 20 questions and reached the end of the listening test.
The folder (at https://reurl.cc/73XArb)contains all the twenty music clips in the listening test.In the folder, there are two subdirectories: "Pop Testing Music" and "Piano_e Testing Music".For each subdirectory, there are ten midi files, for instance, "Question 1(original).midi"and "Question 2(model).midi".Regarding the file name, we further explain with the following examples: "Question 1(original).midi"indicates that the music is for question 1, and all the midi is cut from the original music file."Question 2(model).midi"indicates that the music is for question 2, and part of the midi (i.e., the music transition) is generated through our approach.Therefore, in our experiment, the subject (listener) has no idea which midi file is from original music and which is generated using our approach.
The music pieces in the test were sequences of 768 REMI token events, consisting of ten sequential segments of 768 REMI token events randomly selected from both the pop and classical musical style datasets.Five segments with 256 REMI token events in the middle were replaced by 256 model-generated REMI token events, which were sampled using the temperature-controlled stochastic sampling method with the top-k algorithm [31].The other five segments of REMI token events remained the way they had been generated.In the test, the subjects were asked to comment on the fluency of the music clips, based on a scale from 1 to 5, with 5 being the most fluent and 1 the least, as shown in Figure 15.
In this test, we collected 26 subjects, four of whom are music professionals who know basic music theory and have played musical instruments for more than six years, and the rest are non-music professionals.In pop music, the model-generated clips scored 3.45 on average and clips from original pieces scored 4.19; in classical music, the model-generated clips scored 3.02 on average and clips from original pieces scored 3.55.In the listening experiment, we use a boxplot to illustrate the distribution of the score of each test.In reference to Figure 16, the test scores of original music clips are more concentrated than those of the model-generated ones.In addition, only one or two modelgenerated music clips are considered to be more fluent than most of original music clips.Compared with original music clips, more than half of the model-generated music sounds choppy and rough.

Conclusions
This paper explores music transition, and the goal is to generate a music transition sequence (MTS) that fills in the gaps between a preceding music sequence and a following one so that they can be connected.In the experiment, both pop and classical music datasets were used to train the Transformer-based model, where an assumption had been made that the two music sequences (preceding ones and following ones) and the generated MTS all consist of 256 REMI token events.Under this condition, the Transformer-based model was trained using a total of 1000 epochs, and a comparison of the two sets of model parameters (stack for six layers; stack for three layers) shows that the model with six layers has a slimmer validation dataset loss.In addition, this paper performed twenty questions in the listening test.The music sequences in ten of the questions were existing music clips (768 REMI token events), and the remaining ten questions consisted of model-generated MTSs that had been used to replace those of existing music clips (replacing the middle 256 REMI token events of the original 768 ones with 256 events generated by the model).The test results showed that although the existing music sequences scored higher on average than the model-generated ones did, the average score of the model-generated ones was above 3, which was not the worst.With improvement, it is believed that this experiment can obtain comparable results in future studies.

Figure 3 .
Figure 3. Categories of algorithms for automatic composition.

Figure 4 .
Figure 4.An illustration of transformer model.

Figure 5 .
Figure 5. Encoder and Decoder in the Transformer.

Figure 8 .
Figure 8.An example of REMI representation.

Figure 9 .
Figure 9. Data processing of training samples for model construction.

Figure 13 .
Figure 13.Temperature-controlled stochastic sampling method with top-k.

Figure 14 .
Figure 14.Warmup (learning rate initially set to 0, with warmup step at 4000 for the blue line and 8000 for the orange).

Figure 15 .
Figure 15.The screenshot of the listening test.(The text in the interface is in Chinese.We provide the English translation indicated with the box).

Figure 16 .
Figure 16.The boxplot of test scores.

Table 2 .
Comparison between MIDI-like and REMI representations.

Table 3 .
Results with popular music dataset.

Table 4 .
Results with classical music dataset.