Automatic Melody Harmonization via Reinforcement Learning by Exploring Structured Representations for Melody Sequences

We present a novel reinforcement learning architecture that learns a structured representation for use in symbolic melody harmonization. Probabilistic models are predominant in melody harmonization tasks, most of which only treat melody notes as independent observations and do not take note of substructures in the melodic sequence. To fill this gap, we add substructure discovery as a crucial step in automatic chord generation. The proposed method consists of a structured representation module that generates hierarchical structures for the symbolic melodies, a policy module that learns to break a melody into segments (whose boundaries concur with chord changes) and phrases (the subunits in segments) and a harmonization module that generates chord sequences for each segment. We formulate the structure discovery process as a sequential decision problem with a policy gradient RL method selecting the boundary of each segment or phrase to obtain an optimized structure. We conduct experiments on our preprocessed HookTheory Lead Sheet Dataset, which has 17,979 melody/chord pairs. The results demonstrate that our proposed method can learn task-specific representations and, thus, yield competitive results compared with state-of-the-art baselines.


Introduction
Automatic melody harmonization has received a great deal of attention since it was first introduced [1]. Automatic harmonization can help both untrained people as well as practicing musicians in their music creation tasks. Most existing methods, however, are limited to generating chord sequences from fixed predefined structures, such as a bar or half a bar. More meaningful and flexible structures (such as "phrases") are rarely considered in melody harmonization. To create chord accompaniments that are more accurate (w.r.t. musical rules) and perhaps also richer in colors, we present a reinforcement learning-based method to automatically discover phrase-level structure and predict the onset of chords for a given piece of melodic music.
Automatic melody harmonization has a long history starting in the 1970s. In the early stages, harmonization systems relied heavily on hard rules. Similar to language processing, music has hierarchical structures that can be described as a parallel to a grammar. Hence, some works have created a set of grammar-like rules that can generate chords from a melody. For instance, a rule-based attempt was proposed by Ebciogluv et al. [2]. With over 270 rules included, that system was designed for the harmonization of four-part chorales in the style of Bach. This was followed by two subsequent works [3,4]. In some of the latest automatic harmonization and generation works based on learning, these rules from the past were used to enhance the performance of statistical methods [5][6][7][8]. Learningbased methods, in general, can avoid the intensive labour needed for enumerating all the necessary rules.
Given the high availability and low costs of computation power today, many attempts have been made to employ probabilistic models and deep learning techniques in harmonization problems. Hidden Markov Models (HMMs) have been shown to be effective in melody harmonization [9][10][11][12]. Like spoken languages, music also exhibits long term dependencies in melody and chord sequences [12].
However, rather than modelling the dependencies between observations, HMM treats each observation independently. To address the gap, a line of research has attempted to employ deep learning techniques to model long-term dependencies over a sequence of melodic bars. The studies by [13,14] use Bidirectional Long Short-Term Memory (BiLSTM) to model dependencies between adjacent music events and generate chord sequences correspondingly. These methods outperformed probabilistic models and successfully demonstrated the importance of long-term dependency in music.
Although deep learning techniques have improved the performance of harmonization tasks, they are usually limited to a fixed number of generated chords per melody bar or measure. In [7,9,13,15], the proposed methods are limited to generating one chord for a measure, and, in [14,15], to one chord per a half-bar. Very rarely, however, have researchers considered substructures (e.g., phrases or segments in a melody) for melody harmonization tasks. Tsushima et al. proposed using tree structures to model chords and harmonic functions [11]. However, they did not explore the substructures in melody sequences. Tree structures can be very complex and usually require explicit phrase boundary annotations.
We believe that exploring substructures in melodies can help to learn a better representation for melody harmonization tasks. In this work, we define two terms utilized to learn a structured representation for melodies: (a) Segment, the boundary of which concurs with the chord change; that is, one segment can be used to generate one chord. (b) Phrase, which further divides a segment into subunits. The two terms are intuitively illustrated in Figure 1.
The substructures in a melody connect with each other to form a meaningful musical thought [16], which implies that analysing the structural components can help to better understand a whole melody and, thus, generate chords for it. Our method first identifies the boundaries of phrases and segments in a melody. With the obtained boundaries, we discover the note-level connections in a phrase, the phrase-level connections in a segment, and segment-level connections in a melody sequence with a hierarchical LSTM. Finally, each learned segment representation is utilized to generate chords.
Phrase Segment GPR boundary Figure 1. Phrases and segments in "Auld Lang Syne". The segment's start and end align with the change of chords. The phrases are subunits of the segments and are acquired by splitting the segments with the GPR boundary rules (Grouping Preference Rules [17] in the Generative Theory of Tonal Music (GTTM) [18]), which is introduced in Section 2.5.
The definition of "phrase" is ambiguous in the field of music. In western classical music, a "phrase" usually refers to a musical sequence that consists of consecutive notes expressing a complete musical thought [19], and such a phrase is roughly 4-8 measures long [20]. In today's popular music, the end of a "phrase" normally coincides with the taking of breath [21]; such "phrases" are usually longer than what we need in structure discovery for chord generation. For example, in the dataset used in [22], most of the human-annotated phrases have a length of around nine notes.
Grouping rules have also been developed to split a melody into phrases for music analysis. GPR in GTTM has been applied to obtain music phrases and furthermore to help with melody phrase embedding in [23]. However, GPR contains several subrules, each of which yields different possible boundaries [24], resulting in a challenge to consider all the subrules in phrase boundary identification. Moreover, the phrase-level structure that best suits the harmonization tasks is not necessarily a part of GPR or similar rules. Consequently, the prediction of phrase boundaries in this work is performed in an unsupervised manner.
The system proposed in [25,26] showed the possibility of leveraging reinforcement learning in duet generation and structured representation learning. We propose, herein, a novel melody harmonization model based on reinforcement learning that can achieve state-of-the-art results when applied to the Hooktheory Leadsheet Dataset (HLSD). (The Hooktheory Leadsheet Dataset was compiled by [14] from the Hooktheory website https: //www.hooktheory.com/site, and the dataset is available on https://github.com/wayne3 91/lead-sheet-dataset; accessed on 4 October 2021). We adopted the Synchronous Advantage Actor Critic (A2C) algorithm [27] and treated structure segmentation as a sequential decision problem that identifies the boundaries of segments and phrases, which facilitates chord generation by learning structured representations for melody sequences. Such a method can be easily applied in various music information retrieval (MIR) tasks where hierarchical structures exist but task-specific structure annotations are not built.
This paper attempts to make the following contributions: 1. to employ reinforcement learning, for the first time, to discover substructures in melody for symbolic melody harmonization; 2.
based on the discovery of substructures, to deal with the tasks of segmentation and harmonization by learning structured representations of the given symbolic melody; 3.
through experiments using our processed dataset to show that our proposed method outperforms other baseline methods.

Model
The goal of this paper is to improve melody harmonization using optimized representations that consider phrase-level structures. As shown in Figure 2, the overall architecture consists of three components: Structured Representation Module (REP), Segmentation Module (SEG) and Segment Harmonization Module (HAR). With a three-level LSTM, REP learns the note-level, phrase-level and segment-level representation for each note. The three-level representations are then concatenated to form the state of each note, based on which SEG samples an action deciding whether the current note is the boundary of a phrase or segment.
After SEG decides the actions for all the notes in a melody, REP will translate the action sequence into a structured representation (note-level representations are grouped into phrase-level representations, which are further grouped into segment-level representations). Finally, with the learned representations, HAR generates the chords for each of the obtained segments. Given a melody M = (m 1 , m 2 , . . . , m T ) that consists of T notes, REP provides the state of each note to SEG. SEG then outputs a sequence of decisions A = (a 1 , a 2 , . . . , a T ), a t ∈ {0, 1, 2}, where a t = 0 indicates note m t is inside a phrase, a t = 1 indicates note m t is at the end of phrase, and a t = 2 indicates note m t is at the end of a segment. Consequently, the melody is formed by a sequence of segments, denoted as where L is the number of segments in melody M. For each segment g i , we have where N i is the number of phrases in segment g i . For each phrase, we have where K j is the number of notes in phrase p i,j (the j-th phrase from the i-th segment). i ∈ (1, L) and j ∈ (1, N i ).
As a feedback signal, the sampled action a t of the RL agent for each note will be sent back to REP and translated into a structured representation using a hierarchical LSTM network. Further, the learned structured representations will be fed into HAR to predict a chord sequence C 1:L for melody segments G 1:L . The ultimate goal of HAR is to approximate the function h : As the three modules interact with each other, we train them jointly. In the first stage, to pre-train REP and HAR, we use the chord boundaries provided by HLSD (as the segment boundary) and boundaries acquired by GPR from GTTM (as the phrase boundary). In the second stage, we fix the parameters of REP and HAR and train SEG in a semi-supervised manner. The chord boundaries in the HLSD are utilized as the ground-truth label to train SEG for segment boundary prediction. The prediction of the phrase boundary is trained in an unsupervised way as there is no ground truth for phrase-level structures in the dataset. In the last stage, the three modules are trained together to achieve the best harmonization results.
To better understand the process, we first introduce the symbolic melody encoding and how REP learns structured representations. After that, we explain how we adopt reinforcement learning to explore phrase-level and segment-level structures and split melody sequences into segments and phrases. Finally, we describe how HAR predicts the chords for the melody segments with the learned structures.

Symbolic Melody Encoding
In our melody harmonization task, each music piece consists of two components: (1) the human-composed melody M = (m 0 , m 1 , . . . , m T ) and (2) the machine-generated chord accompaniment C = (c 0 , c 1 , . . . , c L ). As note-based representation is closer to human perception in music composition and harmonization, we adopt a note-based representation. Similar to [28], we extract pitch and duration information from a melody token, and encode them into a 13-dimensional vector x. Specifically, the 13-dimensional vector contains two kinds of information: (1) Chroma information: each of the index in x represents a pitch-class from {Rest,C,C#, D,D#,E,F,F#,G,G#,A,A#,B}, in which we consider C# and Db, D# and Eb, F# and Gb, G# and Ab, A# and Bb to be enharmonic equivalent; therefore, the involved pitch-classes are all spelled with sharp signs.
(2) Duration information: the vocabulary representing the duration context is denoted as D = {1, 2, 3, . . . , 24}, for which 1 is the duration value of the sixteenth note (the resolution of each melody measure in our processed HLSD. For details, see Section 3.1). For each melody note m t , whose pitch class is r and duration value is d, the r-th element in its 13-dimensional feature vector x t would be d.

Structured Representation Module
Given a melody M = (m 1 , m 2 , . . . , m T ) containing T notes, each note is represented by a 13-dimensional feature vector x t containing chroma and duration information. In order to learn the structured representations for a given melody, we employ a hierarchical LSTM (HLSTM) consisting of three levels: a note-level LSTM to sequentially connects notes in a phrase, a phrase-level LSTM to capture phrase-level context dependencies and a segmentlevel LSTM to yield a comprehensive representation of the whole melody sequence (see Figure 4). Clearly, the update of cell states and hidden states in our HLSTM depends on the actions sampled by SEG.  At the note level, an individual LSTM model LSTM n is used to connect a sequence of melody notes to construct a phrase. The propagation of the note-level LSTM depends on the action at position t − 1. If the previous note is at the end of a phrase or a segment, say a t−1 = 1 or a t−1 = 2, LSTM n will start to connect notes for a new phrase with a zero-initialized state. More precisely, where x is the vector representation of melody note m t ; h n t ∈ R d and c n t ∈ R d are the current hidden state and current memory cell at position t of LSTM n , respectively. The propagation of phrase-level LSTM also relies on action a t , which the agent takes at note t in a melody sequence. When a t = 0, it indicates that note m t is still in a phrase and the phrase has not been completely constructed; thus, the hidden state and memory cell state are directly copied from the preceding position t − 1. When a t = 1, the note m t is at the end of a phrase and the phrase is now completely formed. When a t = 2, the note m t is at the end of a segment and is also the boundary of the last phrase in this segment. Hidden state h p t and memory cell c p t will be updated by LSTM p : where h p t ∈ R d and c p t ∈ R d are the current hidden state and the current memory cell at position t of LSTM p .
Similarly, actions also decide how the hidden state and cell state are updated in the segment-level LSTM. When a t = 0 or a t = 1, it indicates that note m t is still in a segment and the segment has not been completely constructed; thus, the hidden state h s t and memory cell state c s t will not be updated. When a t = 2, the note m t is at the end of a segment, and the segment is now completely formed. Hidden state h s t and memory cell c s t will be updated by LSTM s : where h s t ∈ R d and c s t ∈ R d are the current hidden state and the current memory cell at position t of LSTM s .

Segmentation Module
The problem of structure segmentation is formulated as a reinforcement learning problem in SEG where we have an agent interacting with an episodic environment . At each time step t, the agent will receive a state s t and take a corresponding action a t ∈ {0, 1, 2} as a response that decides whether the note m t is the boundary of a phrase or a segment. As the segmentation is performed in a semi-supervised way, we design two rewards to evaluate the decision sequence A = (a 1 , a 2 , . . . , a t ). One is an intermediate reward to compare the predicted segment boundaries with the chord boundaries in HLSD. The other is a long-term delayed reward that measures the performance of phrase boundary prediction grounded on harmonization results.

State
During structure segmentation, we adopt a stochastic policy π to generate a conditional distribution π θ (a | s) over actions conditioned on the current state where θ denotes the parameters in SEG. The state s t at time step t encodes the current input and previous contexts for deciding whether the note at position t is a boundary of a phrase or a segment. To provide adequate information, state s t is composed of current note-level hidden state and memory state, previous phrase-level and segment-level hidden and memory state: where ⊕ denotes the vector concatenation operation.

Action
The policy π samples action a t ∈ {0, 1, 2} by the conditional probability π θ (a t | s t ) to represent whether a phrase or a segment is formed. a t = 0 means the current note is inside a phrase or a segment. a t = 1 indicates the current phrase is now constructed completely. a t = 2 reveals the current segment is now entirely formed. Formally, where θ = {W, b} denotes the parameters used in SEG.

Reward
For the intermediate reward, we use the chord boundary labels from HLSD to evaluate the predicted segment. If the segment boundary is correctly predicted, a positive reward will be received. Otherwise, a negative reward will be given. More precisely, where B target is set of notes that are at the chord boundaries provided by the dataset. The intermediate reward ranges from −1 to 1.
As there is no ground truth for phrase boundaries, considering that the ultimate goal of our model is to generate an appropriate sequence of chords to accompany the given melody, we design the delayed reward function based on harmonization results. Compared with the ground truth, the segment boundaries are not necessarily predicted correctly. The number of chords generated for each melodic measure may vary and even be different from the ground truth. Hence, we propose a weighted accuracy WA as the delayed reward. Formally, where 1 {δ} is the indicator function, which outputs 1 when condition δ is true and 0 otherwise.Ŷ T and Y T are the set of predicted chord symbols and ground-truth chord symbols distributed to each note m t in a melody M 1:T . d t denotes the duration context of note m t . To give an intuition, assume we have a melodic bar with four notes, the duration sequence D 4 of which is (4,4,4,4). If the segmentation result from SEG is (0, 2, 0, 2), it means that HAR would generate one chord for the first two notes and another chord for the last two notes. Suppose the first generated chord is C, the second generated chord is G, and the ground truth has only one chord, which is C for this bar, then we haveŶ 4 = (C, C, G, G) and Y 4 = (C, C, C, C). In this case, the weighted accuracy would be 0.5. We use the weighted harmonization accuracy as the criterion to encourage a better harmonization performance for each of the notes in a melody. By doing so, a structured representation, which is beneficial to the harmonization task can be learned.

Policy
The vanilla policy gradient algorithm, REINFORCE, proposed in [29] is a common method to train an RL agent but is also known for its high variance in the gradient estimate [30], which tends to induce poor convergence. To mitigate this problem, we adopt the Synchronous Advantage Actor Critic (A2C) [27] method in our SEG. To verify whether A2C continually yields a better performance than REINFORCE in the melody harmonization task, we selected the REINFORCE algorithm as a baseline method in Section 3.4.
In A2C, a critic network is used to learn the state-value function V(s), which estimates the average expected return. "Advantage" A(s, a) is introduced to show the advantage of performing action a t under state s t . This offers an efficient way to approximate V(s) only, rather than both Q(s, a) and V(s).
We can maximize A(s, a) to update the actor network π θ (a t | s t ) in SEG with the following policy gradient: As for the critic network, we use the squared loss to update its parameters : Accordingly, the gradient to update critic network can be represented as: The details of our training process of A2C is shown in Algorithm 1 and Section 3.2.  (7); Sample action a t ∼ π θ (a t | s t ) by Equation (8); Update h n t , c n t , h p t , c p t ,h s t and c s t by a t ; end for Compute delayed reward by Equation (10); end for Update θ ← θ + α∇J (θ) using Equation (12) for actor network; Update w ← w + β∇J (w) using Equation (14) for critic network; end for

Melody Harmonization Module
The melody harmonization module produces a probability distribution over chord classes based on the content vector from the structured representation module. Formally, we have P(c | g) = softmax(W s x s + b s ), where x s is the segment level representation vector from LSTM s . To train HAR, we use cross entropy as the loss function: wherep(c, g) is the one-hot distribution of segment sample g, and 84 is the number of possible chords in our HAR (84 possible chords are explained in Section 3.1).

Training Details
As the operations of REP, SEG and HAR are interleaving, the three modules should be trained jointly. The entire training process consists of three steps: (1) As a warm-start to pre-train REP and HAR, we split melodies at every onset position of chords in the dataset to obtain the segment boundaries and utilize GPR from GTTM to acquire the phrase boundaries; (2) we fix the parameters of REP and HAR and train SEG, as Algorithm 1 shows; and (3) we train REP, SEG and HAR jointly: REP provides state representations to SEG, SEG splits the melody sequence into segments and phrases, REP updates the HLSTM conditioned on the sampled actions from SEG, and HAR provides chords for the obtained segments.
To pre-train REP and HAR more efficiently, we utilize the chord boundaries from HLSD to be the segment boundaries and apply GPR from GTTM to acquire the noisy phrase boundaries. GPR has been verified as being more efficient in melodic segmentation via psychological experiments [17]. It contains several sub-rules, each of which is developed on different criteria and, therefore, yields different boundaries. In this work, we adopt one of the most commonly used sub-rules, GPR 2b, to obtain the noisy phrase boundaries [23,24,31]. In GPR 2b, edges of melody phrases appear when the difference of inter-onset-intervals (∆IOI) is negative.
Given a sequence of melody notes (m 1 , m 2 , . . . , m T ), each of which has an onset time o t , the inter-onset-interval between note m t and m t+1 can be calculated as IOI t = o t+1 − o t . The difference between IOI t and IOI t+1 is defined as ∆IOI t = IOI t+1 − IOI t . When ∆IOI t is negative, the note m t can be chosen as a phrase boundary. Figure 5 shows an example that how GPR can be used to identify phrase boundaries.

Dataset
In this work, we use the Hooktheory Lead Sheet Dataset (HLSD) [14] to evaluate our proposed method. The dataset is collected from a user-contributed platform Hooktheory. It consists of high-quality, human-composed melodies along with corresponding chord accompaniments. Compared with other similar datasets, such as CSV Leadsheet Database [13] (collected from Wikifonia.org before the website terminated), HLSD provides more rhythmic chord sequences. Moreover, CSV Leadsheet Database only provides one chord for a melody bar, whereas in HLSD, there can be more than one chord in a melody bar. We can utilize chord onset positions from HLSD as the boundaries of our defined segments, and thus pre-train our REP and HAR.
We preprocess HLSD and split the melody/chord pairs every four bars without overlap since the remained shortest music piece contains only four bars after removing those without chord lines. That means, each melody sample in our preprocessed dataset contains four bars. In addition, we filter out melody samples where the number of notes are less than 12 and longer than 32. For bars that have fewer than three notes, phrase exploration is redundant. Filtering out samples longer than 32 aims to save training time. From Figure 6, samples with length of 32 (in red) is at a watershed in terms of frequency.  To normalize different characteristics of melodies and chords in different keys and maintain the data consistency, we utilized the C key version of the selected music samples, which is provided by HLSD (C major or c minor based on the original key signature). We divided the preprocessed dataset into a training set containing 14,313 train sequences (from 8000 songs) and 3666 test sequences (from 2000 songs) (which is available in https: //github.com/TeresaTsang/preprocessed-HLSD; accessed on 4 October 2021). In our experiments, the proposed method was able to generate 84 possible chords. The chords can be any type from {major, minor, diminished, seventh, major-seventh, minor-seventh and fully-diminished-seventh} with 12 possible root notes C, C#, D, D#, E, F, G, G#, A, A# and B (also spelled with sharp signs as in Section 2.1). Some chords in HLSD are beyond our chord vocabulary, which we transformed into one of the 84 chords based on their chord constructions. For chords consisting of more than four notes, we kept their root notes and dropped excessive notes, i.e., 9ths, 11ths or additional extensions. Figure 7 shows the chord distribution of our preprocessed HLSD.  Our model performs both chord rhythm prediction and melody harmonization. As most of the existing methods are limited to generating fixed number of chords for one bar or a half-bar, we first extract the chord rhythms from HLSD as the segment result and thus compare our method with baselines to analyse the engagement of phrase-level structures. Later, we evaluate the performance of our reinforcement learning-based segmentation method.

Experiment Settings
The note-level, phrase-level and segment-level LSTMs all have a hidden dimension of 256. Adam algorithm [32] is employed as the optimizer with learning rate α = 1 × 10 5 . The batch size is 64 in the pre-training of the three modules, which is adjusted to be 5 when they are trained jointly. The detailed parameter settings of each layer are shown in Figure 8.

Baselines
In this section, we first solely evaluate our REP and HAR using the chord boundaries from HLSD and phrase boundaries obtained by GPR. We compare our proposed method with SVM, CNN, LSTM and BiLSTM+BGS, each of which will provide a chord per obtained segment from HLSD. To fairly evaluate the performance of different models, we used the same symbolic music encoding representation for all the models as introduced in Section 2.1.
• SVM: Support Vector Machines (SVMs) [33] are a traditional machine learning method for classification problems. We applied the C-Support Vector Classification (SVC) algorithm in our harmonization task with its default configurations provided by the scikit-learn library [34]. As we have 84 possible classes when generating the target chords, the decision function type is set as "ovo" (one versus one strategy) in SVC, which is always employed in multi-class strategy. • CNN: Convolution Neural Networks have been widely used in music generation tasks. We built the CNN architecture with two 2D convolution layers. The first one is constructed with a kernel size of 3 × 3 and followed by a pooling layer of size 2 × 2.
The second one has a kernel size of 4 × 4, also followed by a pooling layer of size 2 × 2. • LSTM: LSTM is specialized for processing sequential data. For the experimental setting, a time-distributed input layer is built before the LSTM layer. The input layer has 13 units, representing the sequence of note feature vectors. The one-layer baseline LSTM network has 128 units (to be compared with BiLSTM+BGS), whose output is then fed into a fully-connected output layer with 84 units, representing the generated chord sequence. • BiLSTM+BGS: A BiLSTM-based model with blocked Gibbs sampling was proposed in [15] for melody harmonization. In their work, the melody and partially masked chord sequences were fed into the model and the model was expected to learn the masked chord ground truth. We adopted the blocked Gibbs sampling strategy, which was used in [15] to mask the chord sequences. The sampling process uses an annealed masking probability as the proportion to randomly select chords to be masked. Formally, where α i is the proportion of variables that remain unchanged at iteration i, N is the total number of iterations and set to 128 (the averaged length of chord sequences in each batch) and α min = 0.05, α max = 1. We employed the architecture proposed in their work with our proposed melody feature representation (13-dimensional vector). The masked chord ground truth and melody context are fed into their model and concatenated together. The concatenated context is then sent to a two-layer BiLSTM with hidden size of 64 (the same as [15]), followed by a dropout layer and a fullyconnected output layer with 84 units.
Other parameters involved in all the baselines, such as optimizer, dropout layer, and batch size, are the same as the settings in our proposed method.

Metrics
For the harmonization task, as we have ground-truth chord symbols from HLSD, we can evaluate the performance of harmonization by comparing with the ground-truth labels. We first evaluate our method in terms of accuracy. Although the evaluation of music could be highly subjective, the accuracy measure can give us the most reliable intuitive results. We use an accuracy metric to show the performance of our model. The accuracy is computed by dividing the number of correct predicted chords by the total number of samples. That is, Acc. = num. of correct chords num. of predictions .
In this section, we solely evaluate the performance of REP and HAR; hence, we utilize all the chord boundaries and directly predict one chord for one segment. In other words, we do not need considering chord rhythm issues. Each predicted chord is exactly at the correct onset time as annotated in the dataset. The number of predictions will equal the number of chords in the ground truth. The predicted chords that are identical with the ground truth are recognized as the correct chords.
In addition to the accuracy, we also employ two metrics to evaluate the quality of harmony and the tonal distances between predicted chords and the ground truth: (1) Melody-Chord Harmonicity (MCH) [35], which measures the harmony between melodies and chords, and (2) Tonal Pitch Step Distance (TPSD) [36] to estimate the tonal distances between generated and ground-truth chords.

Results
The evaluation of the harmonization task is shown in Table 1. In terms of accuracy, our proposed method outperformed all the baselines with Acc. reaching 37.42%. This indicates that substructure discovery in melody can help improve harmonization performance. As a reference, the BiLSTM and MTHarmonizer proposed in [14] in 2020 achieved accuracies of 35% and 38% on their compiled HLSD, whereas the accuracy of random guess was 2% (from Figure 7, Zero-R classifier's accuracy was 43.49%).
This implies that our 37.42% is indeed a respectable score comparable to the state of the art. For a fairer comparison, the work in [14] preprocesses HLSD by keeping two chords in a bar and dropping the superfluous chords, which means each chord serves for a half-bar. Their models output one chord per half-bar, which is then compared with the preprocessed ground truth. However, in the actual HLSD, there are numerous pieces where three or more chords accompany a single bar, and, in some pieces, one chord accompanying one note is not uncommon.
Our method and baselines by contrast are able to generate chords for melody sequences with different lengths, conforming to the chord boundaries in the raw HLSD. That means our model needs to deal with short sequences while the work in [14] did not. Understandably, our task is more challenging and more error-prone. As a result, we can say that 37.42% accuracy represents a competitive performance. Regarding the harmonicity metrics, we also computed the metric values on the original dataset as a reference. Before analysing the results, we briefly describe the value range of MCH scores and TPSD scores. MCH scores range from 0 to infinity, and lower MCH scores reveal higher harmonicity between melody and chord. TPSD scores range from 0 to 1, of which 1 indicates perfect tonal similarity between two chords. Since the MCH and TPSD scores are very close to each other, we apply paired sample t-test to measure the differences between our results and other baselines.
As Figures 9 and 10 reveal, all the p-values of each paired t-test are lower than 0.05, which indicates the significant differences between our MCH/TPSD score and others. Interestingly, we can observe that human-composed original pieces are worse than most of the other methods with respect to MCH score. Our proposed method happens to have the second worst MCH score. This is because a relatively high MCH score can tolerate dissonances in some chord sequence, which is a common practice in music creation. From this perspective, the ability of our proposed method to generate dissonances for different colours is closest to the original dataset.
On the other hand, The MCH score of our method is in an acceptable range, which is only 0.1797 higher than the best MCH score produced by LSTM. As for the TPSD score, our method yielded the best TPSD score, indicating that, compared with the other baselines, our model managed to generate chords that were closer to human-composed chords.

Baselines
As few works concerns flexible structures in melody, we selected the Melisma Music Analyzer developed by Sleator and Temperley [37] as the baseline in structure analysis. Influenced by GTTM, Melisma formulates a set of rules for each function it provides. We mainly employed the grouping analysis and harmony analysis from Melisma. The grouper program introduced in Melisma aims to group melodic notes into phrases, while the harmony program is able to predict the chord sequence for a given melody. They are very similar to our SEG and HAR in functionality. To compare the performance of different RL algorithms, we selected REINFORCE as the second baseline in the structure analysis while REP and HAR are still engaged. The experiment setting of REINFORCE is similar to A2C except for the different update rule: θ ← θ + αG t ∇ θ log(π θ (a t | s t ; θ)), (19) where G t is the cumulative return with discount rate γ = 0.95.

Metrics
For the structure segmentation task, we first evaluated the performance of segment identification by comparing with the chord boundary from HLSD. We employed the dominantly used metrics Precision, Recall, F-1 Score to measure the segmentation result. Assume that the predicted segment boundaries is organized in a setÂ = {â 1 ,â 2 , . . . ,â T } and the set of ground truth segment boundaries is denoted as A = {a 1 , a 2 , . . . , a T }, then the Precision can be expressed as: where TP stands for True Positive, specificallyâ t = 2 and a t = 2; FP denotes False Positive case, specificallyâ t = 2 and a t = 0. Similarly, Recall is formally written as: where FN denotes the False Negative case, specificallyâ t = 2 and a t = 2.
Thus, the F1-score can be computed with Precision and Recall: For the prediction of phrase boundaries, there is no ground truth that can be applied to statistically indicate how well the phrases are identified in a melody sequence. We, therefore, evaluate the final harmonization result to show the importance of phrase-level structures. When considering the structures, one of the issues should be addressed is that SEG might wrongly predict the segment boundaries comparing with the ground truth in HLSD. Hence, we employ the weighted accuracy, which was introduced in Equation (10): Since the harmony analyser in Melisma can only predict the chord root for a melody segment,Ô T and O T are the set of predicted and ground-truth chord root notes mapping to each note m t in melody M 1:T here (different from WA in Equation (10)).

Results
From the results shown in Table 2, we can see that the weighted accuracy is much lower than the accuracy we obtained in Table 1. This is understandable because the segment boundaries need to be learned in this section whereas the results in Table 1 are obtained with correct segment boundaries. The errors in the segmentation task are inevitable and will also lead to a mismatching between the ground-truth chords and generated chords. However, our method indeed outperformed Melisma in both segmentation and harmonization tasks. In SEG, diverse choices of RL algorithm also produced different results. A2C outperformed REINFORCE in almost all metrics except for Recall of the segmentation, which implies that the superiority of A2C also holds in our melody harmonization task. As the prediction of phrase boundaries is trained in an unsupervised manner and there is no ground truth for it, to intuitively show the performance of the segmentation task, we illustrate the results of two samples after being split into phrases, segments and assigned with chords, in Figures 11 and 12. In each figure, the ground truth from HLSD is written in black, results from our model are in red and the results from Melisma are in purple. Purple horizontal lines illustrate the chord boundaries identified by Melisma, the bule lines demonstrate the phrase boundaries predicted by A2C and the red lines show the recognized segments by A2C. Similarly, the yellow vertical lines illustrate the phrase boundaries predicted by REINFORCE, and the orange vertical lines exhibit the segments identified by REINFORCE.
In Figure 11, the sample is from a popular song, "Auld Lang Syne", where we can clearly see that our model is able to break the melody into variable-length segments and phrases. Although Melisma can also perform segmentation for chord boundaries, our method can yield chord boundaries closer to the dataset. The phrase boundaries (in blue) obtained by our method with A2C can mostly coincide with the grouping results from GPR while the REINFORCE method can only identify one phrase.
For the generated chords, A2C can help to predict more chords correctly than REIN-FORCE and Melisma. For the wrongly predicted chords by A2C and REINFORCE in the second bar, they are not too far away from the melody as a correct G is predicted right after C. The REINFORCE method also predicts wrongly in the third bar-a possible explanation for this may be the lack of substructures identified in this measure. This observation further shows that substructures in melody can better help with harmonization. As for Melisma, it predicts chords mainly based on the notes at the chord onset positions.
Ours with REINFORCE C C G G F Figure 11. A sample from "Auld Lang Syne" processed by our model and Melisma in C major key.
Ours with REINFORCE C C F C Figure 12. A sample from "Eight Days A Week" by the Beatles processed by our model and Melisma in C major key.
In Figure 12, we can see that our method with both A2C and REINFORCE can correctly predict all the segment boundaries (chord boundaries); whereas Melisma can only predict a few of them. The harmonization results obtained by our method are almost consistent with the ground truth except for the second bar. The wrongly predicted C chord sounds inharmonious with the notes in the bar. A possible explanation for this might be that a major C would be learned when there exist pieces with more complicated C-rooted chords. The C major chord along with notes A and D (in the second bar) can constitute a C69 chord, which would then sound fine. In our data processing, however, we converted chords (a 69 chord in this case) that were out of our vocabulary into simpler ones, hence, the C major chord. This reveals a limitation of our data-driven method in melody harmonization where the conversion of chords can introduce deviations. This can be improved in the future by allowing more chord types and collecting more samples for each type in the future. On the contrary, the Melisma-generated chords in the first, third and last bar sound wrong even only with a root note. Although our method has some limitations, it can still outperform the traditional rule-based method. Interestingly, the results of phrase segmentation in Figure 12 can only partially coincide with the grouping rules in GTTM. In the first bar, the first note is identified as a single phrase, which is reasonable as the first note in a music sample is important in terms of setting the tone for the whole piece. In the third bar, the first three notes are regarded as individual phrases, because in here, we can clearly see that the phrases are divided by note duration. All the results from our experiments show that substructures can be used to better comprehend melody sequences and further help improve harmonization results when compared with the baselines.

Conclusions
We present a novel reinforcement learning method that explores segments and phrases to form a hierarchical melody representation, which can help improve melody segmentation and harmonization. We used the onsets of chords in the dataset as the segment boundaries and the phrases obtained by GPR as the phrase boundaries to train the REP and the HAR module as a warm-start. After that, we trained SEG to break the melody into phrases and segments with fixed parameters of REP. Finally, the three modules were trained jointly to learn a refined structured representation for the harmonization task. Experiments showed that structure discovery can help improve the performance of harmonization. This methodology can also be generalized to tackle other MIR tasks where hierarchical structures may lead to better results.
As the segmentation module was trained in a semi-supervised method (the prediction of phrases is trained in an unsupervised way), we believe the performance of segmentation will be enhanced if it can be supervised by a dataset with phrase boundaries well-labelled for use in harmonization tasks. The similarity between model-segmented and humanannotated phrases can be developed as criteria to evaluate phrase discovery performance. We adopted grouping rules for more efficient learning, but we did not dig into the structures of chords. We believe that we could achieve higher performance with the incorporation of more rules about chord patterns and harmony analysis in our future work.
Author Contributions: Both authors contributed to this work, including the problem formalization, the ideas development, and the manuscript writing. T.Z. implemented the approaches, preprocessed the data, and conducted the experiments. F.C.M.L. polished the paper. All authors have read and agreed to the published version of the manuscript.
Funding: This research received no external funding.

Conflicts of Interest:
The authors declare no conflict of interest.