Diffusion-Inspired Masked Language Modeling for Symbolic Harmony Generation on a Fixed Time Grid
Round 1
Reviewer 1 Report
Comments and Suggestions for AuthorsThe manuscript presents a novel encoder-only Transformer framework for symbolic music harmonization, inspired by discrete diffusion and masked language modeling. Instead of autoregressive decoding, the proposed method progressively unmasks harmony tokens according to structured schedules . The system leverages a fixed time-grid representation, stage-awareness embeddings, and pitch-class conditioning to improve harmony generation. Comprehensive experiments on the HookTheory dataset (in-domain) and jazz standards (out-of-domain) are conducted, with results compared against autoregressive baselines (GPT-2, BART).
The following issues have to be addressed before further consideration:
- Some sentences are long and complex, making them harder to follow for interdisciplinary readers (musicologists, engineers). Minor grammatical polishing would improve readability.
- How was the value of n chosen for the Random strategies (5% and 10%)? Was any sensitivity analysis performed?
- For the Midpoint Doubling strategy, how many unmasking steps (T_s) are required to fully reveal a 256-token sequence? This helps the reader understand the exact inference cost.
- The FMD results using POP909 as a reference are interesting, as the baselines perform better. The authors suggest this is because the baselines stay closer to a "pop" style. Could the POP909 dataset itself be more stylistically aligned with the autoregressive baselines' output than the jazz-influenced output of your MD models? This interpretation could be strengthened.
- Only HookTheory and jazz standards are used. The generalizability to other genres (classical, folk, non-Western music) is not explored. Is that possible to include more genres?
- The autoregressive baselines (GPT-2, BART) are standard, but recent specialized music transformers (e.g., Music Transformer, MelodyT5) could provide a stronger point of comparison.
- The description of the inference process for the Random n% strategy (Eq. 8,9) is slightly vague. It states tokens can be assigned "using sampling strategies from the predicted token distributions." For the top-k or top-p sampling used in the experiments should be explicitly stated in the main text or a footnote for reproducibility.
- The authors acknowledge that the unmasking schedules are hand-designed. This is a valid limitation. A more powerful approach would be to learn the optimal unmasking policy, perhaps conditioned on the musical context (e.g., meter, phrase structure). This could be a key point for future work.
Author Response
[Initial comment] The manuscript presents a novel encoder-only Transformer framework for symbolic music harmonization, inspired by discrete diffusion and masked language modeling. Instead of autoregressive decoding, the proposed method progressively unmasks harmony tokens according to structured schedules . The system leverages a fixed time-grid representation, stage-awareness embeddings, and pitch-class conditioning to improve harmony generation. Comprehensive experiments on the HookTheory dataset (in-domain) and jazz standards (out-of-domain) are conducted, with results compared against autoregressive baselines (GPT-2, BART).
The following issues have to be addressed before further consideration:
[Response to initial comment]: We would like to thank you for taking the time to review our manuscript and make suggestions that improve its clarity and overall quality. We have revised the manuscript according to your suggestions and the parts of the text that include new information are indicated using red fonts.
[Comment 1]: Some sentences are long and complex, making them harder to follow for interdisciplinary readers (musicologists, engineers). Minor grammatical polishing would improve readability.
[Response 1]: Thank you for your suggestion to improve readability. We have polished the text and found some grammatical issues that we have corrected. We also tried to split some long sentences. Those corrections are not highlighted within the text, but we believe that we made good progress in terms of grammatical structure and readability.
[Comment 2]: How was the value of n chosen for the Random strategies (5% and 10%)? Was any sensitivity analysis performed?
[Comment 3]: For the Midpoint Doubling strategy, how many unmasking steps (T_s) are required to fully reveal a 256-token sequence? This helps the reader understand the exact inference cost.
[Response to 2 and 3]: Thank you for those comments. We have indeed neglected to analyse these facts and design choices, and therefore we have added the following text in Section 2.3 after equation 7: “For midpoint doubling, the necessary steps to unmask a sequence of $s$ time steps is $log_2(s)+1$, since every step doubles the total tokens that are unmasked. In the current application we have that every harmonic sequence has length $s=256$, and therefore $T_s=9$. For comparable unmasking steps between the two examined unmasking strategies, we initially set $n=10\%$ in the Random $n$\% unmasking, so that $T_s=10$. Since 10 unmasking steps could be considered a relatively small number compared to the total of 256 tokens that need to be revealed (on average 25.6 tokens per step are revealed), we also test for $n=5\%$ (leading to $T_s=20$) to examine whether there is any improvement in introducing more steps. It would be possible to perform more elaborate sensitivity analysis for finding the optimal value of $n$ for the Random $n\%$ strategy, but given the high cost in resources for training, in this paper we only heuristically examine $n=10\%$ and $n=5\%$.”
[Comment 4]: The FMD results using POP909 as a reference are interesting, as the baselines perform better. The authors suggest this is because the baselines stay closer to a "pop" style. Could the POP909 dataset itself be more stylistically aligned with the autoregressive baselines' output than the jazz-influenced output of your MD models? This interpretation could be strengthened.
[Response 4]: Thank you for this comment, we intended to highlight the stylistic similarities between the POP909 dataset and the output of the autoregressive baseline models, even if they are prompted with jazz-style melodies. To make this statement more clear, we have modified the text of this paragraph as follows: “When evaluated with POP909-based embeddings all models showed closer performance, with MD models leading in-domain, followed by baseline models. On the contrary, baseline models are performing better in the out-of-domain harmonization task (FMD = 638.06), i.e., the pop style POP909 is better aligned with the output of the baselines, even though the task is to produce jazz output. The low POP909 FMD scores of the baseline models (e.g., GPT-2 and BART) in both settings might suggest that they ``rigidly'' produce pop-style output even if they are prompted with jazz-style melodies, therefore remaining closer to pop-style harmonic distributions. This fact, in combination with the better out-of-domain FMD scores of our method, indicates that our methodology is effectively more flexible in capturing out-of-domain nuances. In contrast, the baseline models will follow the training style less flexibly.”
[Comment 5]: Only HookTheory and jazz standards are used. The generalizability to other genres (classical, folk, non-Western music) is not explored. Is that possible to include more genres?
[Response 5]: Thank you for pinpointing this issue, which is indeed not discussed thoroughly in our paper and it constitutes a problem for all research on computational melodic harmonization: the existence of high-quality, large-scale datasets. Since data availability is limited, we follow a similar approach to other work in the literature of melodic harmonization research. Such research includes mainly HookTheory as the basic dataset, which includes a satisfactory quality and quantity of data, plus some other dataset that is considered out of domain, mainly jazz, e.g., the Chord Melody Dataset (CMD) in [6]. Instead of the CMD, we use our own curated one with jazz that is a bit larger than the CMD (473 vs 650) and ours does not include inherent restrictions on number of chords per bar, time signatures and time resolution. We have included the following text in Section 3.2 that includes this discussion: “Given the scarcity in high-quality, large-scale data for melodic harmonization, we follow an in-domain and out-domain evaluation approach as in [6]. Thereby, training and in-domain evaluation are performed on the Hooktheory dataset and a jazz dataset is employed for out-of-domain evaluation. In contrast to [6], we do not use the Chord Melody Dataset (CMD) [30], since it includes pieces with some restrictions regarding the number of chords in each bar and restrictions in time resolution for notes. We use our own curated dataset of jazz standard melody harmonizations that is also larger in comparison to the CMD (650 vs 473 pieces)”.
[Comment 6]: The autoregressive baselines (GPT-2, BART) are standard, but recent specialized music transformers (e.g., Music Transformer, MelodyT5) could provide a stronger point of comparison.
[Response 6]: Thank you for bringing up this issue, we noticed that we do not give any details about the inherent restrictions for choosing baseline models for comparison. Furthermore, we do not clarify that we are examining the strict form of melodic harmonization, which disregards rhythm patterns in the chords and is only aimed at finding the proper harmonic context for melodic segments. We clarified the strict approach to melodic harmonization in the Introduction by adding two sentences: a) “Additionally, such methods consider chord rhythm patterns that include chord repetitions as part of the melodic harmonization process. This differs from a stricter definition of melodic harmonization as assigning harmony/chords to melody segments, disregarding chord repetitions”; and b) “The output of the model is a sequence of chord symbols that cover segments of the melody, reflecting pure harmonic rhythm (points where chord symbols change, disregarding chord repetitions).”.
In Section 3.2, where the dataset is presented, we have added the following text: “The pieces in the dataset have been modified to reflect harmonic rhythm, i.e., locations where chords change within each bar. Chord repetitions that reflect rhythm beyond the harmonic rhythm have been removed. The only chord repetition that is allowed is at the beginning of each bar, if the starting chord of a bar is the ending chord from the previous bar”.
We have also added a paragraph that clarifies the two main fundamental reasons why our approach to melodic harmonization is not compatible with the output of the state-of-the-art transformer models at the end of Section 3.1 as follows: “For the comparison with baseline models to be meaningful and fair, it was necessary to develop the above-mentioned baseline models, since specific fundamental limitations make existing state-of-the-art (SoA) models incompatible for comparison. The first limitation is that SoA models that perform melodic harmonization (e.g., [5, 6, 8]), do not consider a strict definition of harmonic rhythm, as we do in our model. I.e., SoA models consider chord rhythmic patterns via chord repetitions within bars, a fact that makes their output incompatible for comparison with the ground truth we use and with the output of our model. A second limitation is that SoA models either output accompaniment instead of harmonization [5], i.e., discrete notes that potentially form elaborate patterns that need an additional inference step to be transformed into chord symbols, or use chord symbols from a restricted dictionary [6,8]. For comparison, in [6] 6 chord qualities for all 12 roots are considered, leading to 72 total chord symbols, while our approach considers 29 qualities for 12 roots, leading to 348 chord symbols in total. Therefore, a direct comparison with SoA models would require significant post processing and additional inference steps that would possibly distort their output and lead to unclear results.”
[Comment 7]: The description of the inference process for the Random n% strategy (Eq. 8,9) is slightly vague. It states tokens can be assigned "using sampling strategies from the predicted token distributions." For the top-k or top-p sampling used in the experiments should be explicitly stated in the main text or a footnote for reproducibility.
[Response 7]: Thank you for this comment, which made us understand that we have not clarified the generation process adequately, both for the proposed model and for the baselines. To clarify this issue, we have added the following paragraph as the second paragraph of Section 4: “During inference, the autoregressive models generate using beam search with 5 beams, because the produced results had better long-term structure than using any other sampling method. Additionally, with beam search, the generated results are consistent across runs and better reflect the ``best take`` of the models in each melodic harmonization task. For the proposed method, a simple multinomial sampling from equation [9] are performed with temperature 1.0. The temperature value 1.0 was decided because the results were not much different in comparison to various temperatures values between 0.5 and 1.5 and we considered that it would be better to therefore use the temperature value that keeps the logits distribution unaltered.”
[Comment 8]: The authors acknowledge that the unmasking schedules are hand-designed. This is a valid limitation. A more powerful approach would be to learn the optimal unmasking policy, perhaps conditioned on the musical context (e.g., meter, phrase structure). This could be a key point for future work.
[Response 8]: Thank you for this comment, we should indeed mention the possibility for improvements with extensions to the hand-designed heuristic unmasking methods. Following your suggestion, we have included the following text at the end of Section 5: “The midpoint doubling unmasking strategy is well motivated by the hierarchical nature of musical structure. However, indicating bar locations with special tokens could enable more elaborate strategies that adapt to bar-level organization-for example, starting harmonization from the last bar containing melody notes”.
Reviewer 2 Report
Comments and Suggestions for AuthorsThis paper introduces a novel encoder-only Transformer architecture for melodic harmonization, trained via a diffusion-inspired masked language modeling framework. By progressively unmasking chord tokens on a fixed 16th-note intervals, the authors avoid autoregressive decoding and greater controllability, and improved stylistic diversity. The technical formulation is rigorous, the evaluation thorough, and the contribution timely, given the rapid interest in discrete diffusion methods for symbolic music.
Strengths:
The integration of diffusion-inspired unmasking with an encoder-only Transformer for symbolic harmonization is original and timely. The work bridges discrete diffusion techniques with music generation, extending methods like MaskGIT into the domain of symbolic harmony. The limitations of autoregressive approaches (lack of global control, sequential inefficiency) are well articulated, justifying the proposed non-autoregressive approach. Results span symbolic music metrics, harmonic rhythm measures, and FMD scores, offering both interpretability and embedding-level comparison.
Weaknesses / Suggestions:
Dataset Narrowness: Training and evaluation are centered around HookTheory (pop) and jazz standards. Broader validation across other genres (classical, folk, non-Western traditions) would test the robustness of the approach.
Hand-Designed Unmasking Schedules: Both random unmasking and midpoint doubling are heuristic choices. The paper notes this as a limitation, but more emphasis on learnable or adaptive strategies would improve the long-term impact. While GPT-2 and BART baselines are included, more recent symbolic music diffusion or transformer-based harmonization methods (e.g., MelodyT5, controllable Music Transformers) could be used for stronger benchmarking.
The paper introduces an original and well-executed contribution to symbolic music generation with strong technical and empirical support.
Author Response
[Initial comment]: This paper introduces a novel encoder-only Transformer architecture for melodic harmonization, trained via a diffusion-inspired masked language modeling framework. By progressively unmasking chord tokens on a fixed 16th-note intervals, the authors avoid autoregressive decoding and greater controllability, and improved stylistic diversity. The technical formulation is rigorous, the evaluation thorough, and the contribution timely, given the rapid interest in discrete diffusion methods for symbolic music.
Strengths:
The integration of diffusion-inspired unmasking with an encoder-only Transformer for symbolic harmonization is original and timely. The work bridges discrete diffusion techniques with music generation, extending methods like MaskGIT into the domain of symbolic harmony. The limitations of autoregressive approaches (lack of global control, sequential inefficiency) are well articulated, justifying the proposed non-autoregressive approach. Results span symbolic music metrics, harmonic rhythm measures, and FMD scores, offering both interpretability and embedding-level comparison.
Weaknesses / Suggestions:
[Response to initial comment]: We would like to thank you for taking the time to review our manuscript and make suggestions that improve its clarity and overall quality. We have revised the manuscript according to your suggestions and the parts of the text that include new information are indicated using red fonts.
[Comment 1]: Dataset Narrowness: Training and evaluation are centered around HookTheory (pop) and jazz standards. Broader validation across other genres (classical, folk, non-Western traditions) would test the robustness of the approach.
[Response 1]: Thank you for pinpointing this issue, which is indeed not discussed thoroughly in our paper and it constitutes a problem for all research on computational melodic harmonization: the existence of high-quality, large-scale datasets. Since data availability is limited, we follow a similar approach to other work in the literature of melodic harmonization research. Such research includes mainly HookTheory as the basic dataset, which includes a satisfactory quality and quantity of data, plus some other dataset that is considered out of domain, mainly jazz, e.g., the Chord Melody Dataset (CMD) in [6]. Instead of the CMD, we use our own curated one with jazz that is a bit larger than the CMD (473 vs 650) and ours does not include inherent restrictions on number of chords per bar, time signatures and time resolution. We have included the following text in Section 3.2 that includes this discussion: “Given the scarcity in high-quality, large-scale data for melodic harmonization, we follow an in-domain and out-domain evaluation approach as in [6]. Thereby, training and in-domain evaluation is performed on the Hooktheory dataset and a jazz dataset is employed for out-of-domain evaluation. In contrast to [6], we do not use the Chord Melody Dataset (CMD) [30], since it includes pieces with some restrictions regarding the number of chords in each bar and restrictions in time resolution for notes. We use our own curated dataset of jazz standard melody harmonizations that is also larger in comparison to the CMD (650 vs 473 pieces)”.
[Comment 2]: Hand-Designed Unmasking Schedules: Both random unmasking and midpoint doubling are heuristic choices. The paper notes this as a limitation, but more emphasis on learnable or adaptive strategies would improve the long-term impact.
[Response 2]: Thank you for this comment, we should indeed mention the possibility for improvements with extensions to the hand-designed heuristic unmasking methods. Following your suggestion, we have included the following text at the end of Section 5: “While the midpoint doubling unmasking strategy makes sense given the hierarchical nature of music structures, knowing or indicating with special tokens the location of bars within the harmonization would allow for more elaborate strategies to be developed that also learn to adapt to specific bar structures (e.g., start harmonizing from the last bar that includes melody notes)”.
[Comment 3]: While GPT-2 and BART baselines are included, more recent symbolic music diffusion or transformer-based harmonization methods (e.g., MelodyT5, controllable Music Transformers) could be used for stronger benchmarking.
[Response 3]: Thank you for bringing up this issue, we noticed that we do not give any details about the inherent restrictions for choosing baseline models for comparison. Furthermore, we do not clarify that we are examining the strict form of melodic harmonization, which disregards rhythm patterns in the chords and is only aimed at finding the proper harmonic context for melodic segments. We clarified the strict approach to melodic harmonization in the Introduction by adding two sentences: a) “Additionally, such methods consider chord rhythm patterns that include chord repetitions as part of the melodic harmonization process. This differs from a stricter definition of melodic harmonization as assigning harmony/chords to melody segments, disregarding chord repetitions”; and b) “The output of the model is a sequence of chord symbols that cover segments of the melody, reflecting pure harmonic rhythm (points where chord symbols change, disregarding chord repetitions).”.
In Section 3.2, where the dataset is presented, we have added the following text: “The pieces in the dataset have been modified to reflect harmonic rhythm, i.e., locations where chords change within each bar. Chord repetitions that reflect rhythm beyond the harmonic rhythm have been removed. The only chord repetition that is allowed is at the beginning of each bar, if the starting chord of a bar is the ending chord from the previous bar”.
We have also added a paragraph that clarifies the two main fundamental reasons why our approach to melodic harmonization is not compatible with the output of the state-of-the-art transformer models at the end of Section 3.1 as follows: “For the comparison with baseline models to be meaningful and fair, it was necessary to develop the above-mentioned baseline models, since specific fundamental limitations make existing state-of-the-art (SoA) models incompatible for comparison. The first limitation is that SoA models that perform melodic harmonization (e.g., [5, 6, 8]), do not consider a strict definition of harmonic rhythm, as we do in our model. I.e., SoA models consider chord rhythmic patterns via chord repetitions within bars, a fact that makes their output incompatible for comparison with the ground truth we use and with the output of our model. A second limitation is that SoA models either output accompaniment instead of harmonization [5], i.e., discrete notes that potentially form elaborate patterns that need an additional inference step to be transformed into chord symbols, or use chord symbols from a restricted dictionary [6,8]. For comparison, in [6] 6 chord qualities for all 12 roots are considered, leading to 72 total chord symbols, while our approach considers 29 qualities for 12 roots, leading to 348 chord symbols in total. Therefore, a direct comparison with SoA models would require significant post processing and additional inference steps that would possibly distort their output and lead to unclear results.”
[Concluding comment]: The paper introduces an original and well-executed contribution to symbolic music generation with strong technical and empirical support.
[Response to concluding comment]: We would like to thank you again for your valuable suggestions that contributed to the improvement of the manuscript.