HarmonyTok: Comparing Methods for Harmony Tokenization for Machine Learning
Abstract
1. Introduction
2. Harmony Tokenization Methods
2.1. Melody and Harmony Tokenization Foundations
- <bar> to indicate a new bar;
- <rest> to indicate a rest;
- position_BxSD for note onset time;
- P:X for MIDI pitch.
- `<s>’, `<bar>’, `ts_4x4’, `position_0x00’, `<rest>’,
- `position_0x50’, `P:59’, `position_1x00’, `P:59’,
- `position_1x50’, `P:59’, `position_2x00’, `P:59’
2.2. Harmony Tokenization
- ChordSymbol: Each token directly encodes the chord symbol as it appears in the lead sheet, e.g., C:maj7. To ensure homogeneity between different representations of the same chord symbols (e.g., Cmaj and C△), all chords are transformed into their equivalent MIR_eval [30] symbol according to their pitch class correspondence.
- RootType: Separate tokens are used for the root and the quality of the chord, e.g., C and maj7.
- PitchClass: Chord symbols are not used; instead, the chord is represented as a set of pitch classes. For example, a Cmaj7 chord is tokenized as chord_pc_0, chord_pc_4, chord_pc_7, and chord_pc_11.
- RootPC: Similar to the previous method but includes a dedicated token for the root pitch class, e.g., root_pc_0, chord_pc_4, chord_pc_7, and chord_pc_11. This helps disambiguate chords with similar pitch class sets (e.g., Cmaj6 vs. Am7).
3. Experimental Setup, Models, Data, and Evaluation Approach
- Masked Language Modeling (MLM): A self-supervised objective for training encoder-based models such as BERT [31] and RoBERTa [32]. MLM randomly masks a subset of input tokens (typically 15%) and trains the model to recover the original tokens using contextual cues. We follow the RoBERTa strategy, removing the Next Sentence Prediction (NSP) component and focusing solely on predicting <mask> tokens.
- Encoder–Decoder Melody Harmonization Using BART [33]: This task encodes a melody and generates a corresponding harmony sequence autoregressively until an end-of-sequence token or a token limit is reached.
- Decoder-Only Melody Harmonization Modeled Using GPT-2 [34]: Harmony tokens are generated from left to right, conditioned on a melody prompt. While prior work typically uses encoder–decoder models for harmonization, we explore the applicability of decoder-only models in this context.
- Token-Based Metrics: Measure internal consistency and structural validity of generated token sequences.
- Symbolic Music Metrics: Capture harmonic, melodic, and rhythmic attributes of the generated sequences, independent of exact token alignment with ground truth.
- How closely do generated token sequences align with the ground truth?
- Are the generated chord progressions musically coherent, harmonically accurate, and rhythmically appropriate?
- How effectively do BART and GPT-2 leverage different tokenizations to generate high-quality harmonizations?
3.1. Token Metrics
3.1.1. Duplicate Ratio (DR)
3.1.2. Token Consistency Ratio (TCR)
3.2. Symbolic Music Metrics
3.2.1. Chord Progression Coherence and Diversity
Chord Histogram Entropy (CHE)
Chord Coverage (CC)
Chord Tonal Distance (CTD)
3.2.2. Chord/Melody Harmonicity
Chord Tone to Non-Chord Tone Ratio (CTnCTR)
Pitch Consonance Score (PCS)
Melody–Chord Tonal Distance (MCTD)
3.2.3. Harmonic Rhythm Coherence and Diversity
Harmonic Rhythm Histogram Entropy (HRHE)
Harmonic Rhythm Coverage (HRC)
Chord Beat Strength (CBS)
- Score 0 if a chord occurs exactly at the start of the measure (onset 0.0).
- Score 1 if it coincides with any other strong beat (e.g., onsets 1.0, 2.0, and 3.0 in a 4/4 measure).
- Score 2 if placed on a half-beat (e.g., 0.5, 1.5, etc.).
- Score 3 if on an eighth subdivision (e.g., 0.25 and 0.75).
- Score 4 otherwise.
3.2.4. Fréchet Audio Distance Evaluation
4. Results
4.1. Training Analysis
4.1.1. Information-Theoretic Considerations
4.1.2. Training Convergence Analysis
- correct bar: The bar tokens are predicted correctly.
- correct new chord: A position_X token is correctly placed regardless of the specific position value. This indicates that the model correctly identified the onset of a new chord (rather than a new bar or additional pitch) irrespective of the precise timing.
- correct position: The model correctly identifies both the presence and the timing of a new chord.
- correct chord: The complete chord structure is accurately predicted. For the PitchClass tokenizer, which lacks root information, this refers to correctly predicting the full set of pitch classes that constitute the chord.
- correct root: The root note of the chord is correctly predicted. This criterion is not applicable to the PitchClass tokenizer.
4.2. Melodic Harmonization Generation Results
4.2.1. Token and Symbolic Music Metrics
4.2.2. FAD Results
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Sennrich, R.; Haddow, B.; Birch, A. Neural machine translation of rare words with subword units. arXiv 2015, arXiv:1508.07909. [Google Scholar]
- Kudo, T.; Richardson, J. Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. arXiv 2018, arXiv:1808.06226. [Google Scholar] [CrossRef]
- Raffel, C.; Shazeer, N.; Roberts, A.; Lee, K.; Narang, S.; Matena, M.; Zhou, Y.; Li, W.; Liu, P.J. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. 2020, 21, 1–67. [Google Scholar]
- Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.D.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language models are few-shot learners. Adv. Neural Inf. Process. Syst. 2020, 33, 1877–1901. [Google Scholar]
- Kaplan, J.; McCandlish, S.; Henighan, T.; Brown, T.B.; Chess, B.; Child, R.; Gray, S.; Radford, A.; Wu, J.; Amodei, D. Scaling laws for neural language models. arXiv 2020, arXiv:2001.08361. [Google Scholar] [CrossRef]
- Agosti, G. Transformer Networks for the Modelling of Jazz Harmony. Master’s Thesis, Politecnico di Milano, Milano, Italy, 2021. [Google Scholar]
- Hahn, S.; Yin, J.; Zhu, R.; Xu, W.; Jiang, Y.; Mak, S.; Rudin, C. SentHYMNent: An Interpretable and Sentiment-Driven Model for Algorithmic Melody Harmonization. In Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Barcelona, Spain, 25–29 August 2024; pp. 5050–5060. [Google Scholar]
- Cambouropoulos, E.; Kaliakatsos-Papakostas, M.A.; Tsougras, C. An idiom-independent representation of chords for computational music analysis and generation. In Proceedings of the Joint 40th International Computer Music Conference (ICMC) and 11th Sound and Music Computing (SMC) Conference (ICMC-SMC2014), Athens, Greece, 14–20 September 2014. [Google Scholar]
- Kaliakatsos-Papakostas, M.; Katsiavalos, A.; Tsougras, C.; Cambouropoulos, E. Harmony in the polyphonic songs of epirus: Representation, statistical analysis and generation. In Proceedings of the 4th International Workshop on Folk Music Analysis, Istanbul, Turkey, 12–13 June 2014; pp. 21–28. [Google Scholar]
- Kaliakatsos-Papakostas, M.A.; Zacharakis, A.I.; Tsougras, C.; Cambouropoulos, E. Evaluating the General Chord Type Representation in Tonal Music and Organising GCT Chord Labels in Functional Chord Categories. In Proceedings of the International Society for Music Information Retrieval (ISMIR) Conference, Málaga, Spain, 26–30 October 2015; pp. 427–433. [Google Scholar]
- Costa, L.F.; Barchi, T.M.; de Morais, E.F.; Coca, A.E.; Schemberger, E.E.; Martins, M.S.; Siqueira, H.V. Neural networks and ensemble based architectures to automatic musical harmonization: A performance comparison. Appl. Artif. Intell. 2023, 37, 2185849. [Google Scholar] [CrossRef]
- Yeh, Y.C.; Hsiao, W.Y.; Fukayama, S.; Kitahara, T.; Genchel, B.; Liu, H.M.; Dong, H.W.; Chen, Y.; Leong, T.; Yang, Y.H. Automatic melody harmonization with triad chords: A comparative study. J. New Music Res. 2021, 50, 37–51. [Google Scholar] [CrossRef]
- Chen, Y.W.; Lee, H.S.; Chen, Y.H.; Wang, H.M. SurpriseNet: Melody harmonization conditioning on user-controlled surprise contours. arXiv 2021, arXiv:2108.00378. [Google Scholar]
- Sun, C.E.; Chen, Y.W.; Lee, H.S.; Chen, Y.H.; Wang, H.M. Melody harmonization using orderless NADE, chord balancing, and blocked Gibbs sampling. In Proceedings of the ICASSP 2021—2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–11 June 2021; pp. 4145–4149. [Google Scholar]
- Zeng, T.; Lau, F.C. Automatic melody harmonization via reinforcement learning by exploring structured representations for melody sequences. Electronics 2021, 10, 2469. [Google Scholar] [CrossRef]
- Wu, S.; Yang, Y.; Wang, Z.; Li, X.; Sun, M. Generating chord progression from melody with flexible harmonic rhythm and controllable harmonic density. EURASIP J. Audio Speech Music Process. 2024, 2024, 4. [Google Scholar] [CrossRef]
- Wu, S.; Li, X.; Sun, M. Chord-conditioned melody harmonization with controllable harmonicity. In Proceedings of the ICASSP 2023—2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4–10 June 2023; pp. 1–5. [Google Scholar]
- Yi, L.; Hu, H.; Zhao, J.; Xia, G. Accomontage2: A complete harmonization and accompaniment arrangement system. arXiv 2022, arXiv:2209.00353. [Google Scholar] [CrossRef]
- Rhyu, S.; Choi, H.; Kim, S.; Lee, K. Translating melody to chord: Structured and flexible harmonization of melody with transformer. IEEE Access 2022, 10, 28261–28273. [Google Scholar] [CrossRef]
- Zhou, J.; Zhu, H.; Wang, X. Choir Transformer: Generating Polyphonic Music with Relative Attention on Transformer. arXiv 2023, arXiv:2308.02531. [Google Scholar] [CrossRef]
- Huang, J.; Yang, Y.H. Emotion-driven melody harmonization via melodic variation and functional representation. arXiv 2024, arXiv:2407.20176. [Google Scholar] [CrossRef]
- Wu, S.; Wang, Y.; Li, X.; Yu, F.; Sun, M. Melodyt5: A unified score-to-score transformer for symbolic music processing. arXiv 2024, arXiv:2407.02277. [Google Scholar]
- Cholakov, V. AI Enhancer—Harmonizing Melodies of Popular Songs with Sequence-to-Sequence. Master’s Thesis, The University of Edinburgh, Edinburgh, UK, 2018. [Google Scholar]
- Huang, C.Z.A.; Vaswani, A.; Uszkoreit, J.; Shazeer, N.; Simon, I.; Hawthorne, C.; Dai, A.M.; Hoffman, M.D.; Dinculescu, M.; Eck, D. Music Transformer. arXiv 2018, arXiv:1809.04281. [Google Scholar] [CrossRef]
- Wang, Z.; Wang, D.; Zhang, Y.; Xia, G. Learning interpretable representation for controllable polyphonic music generation. arXiv 2020, arXiv:2008.07122. [Google Scholar] [CrossRef]
- Min, L.; Jiang, J.; Xia, G.; Zhao, J. Polyffusion: A diffusion model for polyphonic score generation with internal and external controls. arXiv 2023, arXiv:2307.10304. [Google Scholar] [CrossRef]
- Ji, S.; Yang, X.; Luo, J.; Li, J. Rl-chord: Clstm-based melody harmonization using deep reinforcement learning. IEEE Trans. Neural Netw. Learn. Syst. 2023, 35, 11128–11141. [Google Scholar] [CrossRef]
- Lim, H.; Rhyu, S.; Lee, K. Chord generation from symbolic melody using BLSTM networks. arXiv 2017, arXiv:1712.01011. [Google Scholar] [CrossRef]
- Ji, S.; Yang, X. Emotion-conditioned melody harmonization with hierarchical variational autoencoder. In Proceedings of the 2023 IEEE International Conference on Systems, Man, and Cybernetics (SMC), Honolulu, HI, USA, 1–4 October 2023; pp. 228–233. [Google Scholar]
- Raffel, C.; McFee, B.; Humphrey, E.J.; Salamon, J.; Nieto, O.; Liang, D.; Ellis, D.P.; Raffel, C.C. MIR_EVAL: A Transparent Implementation of Common MIR Metrics. In Proceedings of the ISMIR, Taipei, Taiwan, 27–31 October 2014; Volume 10, p. 2014. [Google Scholar]
- Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; Stoyanov, V. Roberta: A robustly optimized bert pretraining approach. arXiv 2019, arXiv:1907.11692. [Google Scholar]
- Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, (Long and Short Papers). Minneapolis, MN, USA, 2–7 June 2019; Volume 1, pp. 4171–4186. [Google Scholar]
- Lewis, M.; Liu, Y.; Goyal, N.; Ghazvininejad, M.; Mohamed, A.; Levy, O.; Stoyanov, V.; Zettlemoyer, L. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv 2019, arXiv:1910.13461. [Google Scholar]
- Radford, A.; Wu, J.; Child, R.; Luan, D.; Amodei, D.; Sutskever, I. Language models are unsupervised multitask learners. OpenAI Blog 2019, 1, 9. [Google Scholar]
- Krumhansl, C.L. Cognitive Foundations of Musical Pitch; Oxford University Press: Oxford, UK, 2001. [Google Scholar]
- Fradet, N.; Gutowski, N.; Chhel, F.; Briot, J.P. Byte pair encoding for symbolic music. arXiv 2023, arXiv:2301.11975. [Google Scholar] [CrossRef]
- Snover, M.; Dorr, B.; Schwartz, R.; Micciulla, L.; Makhoul, J. A study of translation edit rate with targeted human annotation. In Proceedings of the 7th Conference of the Association for Machine Translation in the Americas: Technical Papers, Cambridge, MA, USA, 8–12 August 2006; pp. 223–231. [Google Scholar]
- Harte, C.; Sandler, M.; Gasser, M. Detecting harmonic change in musical audio. In Proceedings of the 1st ACM Workshop on Audio and Music Computing Multimedia, Santa Barbara, CA, USA, 27 October 2006; pp. 21–26. [Google Scholar]
- Kilgour, K.; Zuluaga, M.B.; Roblek, D.; Sharifi, M. Fréchet Audio Distance: A Reference-Free Metric for Evaluating Music Enhancement Algorithms. Proc. Interspeech 2019, 2019, 2350–2354. [Google Scholar] [CrossRef]
- Wang, Z.; Chen, K.; Jiang, J.; Zhang, Y.; Xu, M.; Dai, S.; Bin, G.; Xia, G. POP909: A Pop-song Dataset for Music Arrangement Generation. In Proceedings of the of 21st International Conference on Music Information Retrieval, ISMIR, Virtual, 11–16 October 2020. [Google Scholar]
- Li, Y.; Yuan, R.; Zhang, G.; Ma, Y.; Chen, X.; Yin, H.; Lin, C.; Ragni, A.; Benetos, E.; Gyenge, N.; et al. MERT: Acoustic Music Understanding Model with Large-Scale Self-supervised Training. arXiv 2023, arXiv:2306.00107. [Google Scholar] [CrossRef]
- Gui, A.; Liu, S.; Yang, Y.; Yang, L.; Li, Y. Adapting Fréchet Audio Distance for Generative Music Evaluation. In Proceedings of the ICASSP 2024—2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, Republic of Korea, 14–19 April 2024; pp. 1331–1335. [Google Scholar] [CrossRef]
- Deutsch, P. Rfc1951: Deflate Compressed Data Format Specification Version 1.3. 1996. Available online: https://www.ietf.org/rfc/rfc1951.txt (accessed on 28 August 2025).
- Ziv, J.; Lempel, A. A universal algorithm for sequential data compression. IEEE Trans. Inf. Theory 1977, 23, 337–343. [Google Scholar] [CrossRef]
- Huffman, D.A. A method for the construction of minimum-redundancy codes. Proc. IRE 1952, 40, 1098–1101. [Google Scholar] [CrossRef]
Tokenizer | Example |
---|---|
ChordSymbol | <h> <bar> position_0x00 C:maj <bar> position_0x00 E:min position_2x00 G:maj |
RootType | <h> <bar> position_0x00 C maj <bar> position_0x00 E min position_2x00 G maj |
PitchClass | <h> <bar> position_0x00 chord_pc_0 chord_pc_4 chord_pc_7 <bar> position_0x00 chord_pc_4 chord_pc_7 chord_pc_11 position_2x00 chord_pc_7 chord_pc_11 chord_pc_2 |
RootPC | <h> <bar> position_0x00 chord_root_0 chord_pc_4 chord_pc_7 <bar> position_0x00 chord_root_4 chord_pc_7 chord_pc_11 position_2x00 chord_root_7 chord_pc_11 chord_pc_2 |
Tokenizer | Vocab. Size | Mean Len. | Max Len. | Total Compr. | Hyp. Compr. Len. |
---|---|---|---|---|---|
ChordSymbol | 436 | 48.14 | 484 | 0.05926 | 2.8525 |
RootType | 129 | 66.07 | 659 | 0.04832 | 3.1922 |
RootPC | 112 | 87.11 | 866 | 0.04140 | 3.6061 |
PitchClass | 100 | 87.11 | 866 | 0.04138 | 3.6045 |
MelodyPitch | 195 | 132.46 | 1350 | 0.07501 | 9.9359 |
Metric | ChordSymbol | RootType | PitchClass | RootPC |
---|---|---|---|---|
BART | ||||
correct bar | 0.9363 | 0.9183 | 0.8929 | 0.8585 |
correct new chord | 0.9092 | 0.9106 | 0.9000 | 0.9111 |
correct position | 0.8511 | 0.8457 | 0.8364 | 0.8454 |
correct chord | 0.5848 | 0.5312 | 0.5759 | 0.5795 |
correct root | 0.6237 | 0.5833 | — | 0.6318 |
GPT-2 | ||||
correct bar | 0.9349 | 0.9152 | 0.9159 | 0.9064 |
correct new chord | 0.9111 | 0.9072 | 0.9030 | 0.8988 |
correct position | 0.8542 | 0.8525 | 0.8527 | 0.8502 |
correct chord | 0.5703 | 0.5527 | 0.5814 | 0.5885 |
correct root | 0.6168 | 0.5959 | — | 0.6434 |
Model | Tokenizer | Duplicate Ratio | Token Consistency Ratio |
---|---|---|---|
BART | ChordSymbol | 0.02 | 96.23 |
BART | RootType | 0.03 | 98.89 |
BART | PitchClass | 0.00 | 99.98 |
BART | RootPC | 0.01 | 99.20 |
GPT-2 | ChordSymbol | 0.01 | 98.05 |
GPT-2 | RootType | 0.01 | 99.09 |
GPT-2 | PitchClass | 0.00 | 99.92 |
GPT-2 | RootPC | 0.00 | 99.50 |
Model | Tokenizer | CHE | CC | CTD | CTnCTR | PCS | MCTD | HRHE | HRC | CBS |
---|---|---|---|---|---|---|---|---|---|---|
Ground Truth | 1.359 | 4.663 | 0.897 | 0.837 | 0.480 | 1.345 | 0.685 | 2.789 | 0.441 | |
BART | ChordSymbol | 0.943 | 2.922 | 0.886 | 0.803 | 0.436 | 1.384 | 0.352 | 1.778 | 0.234 |
BART | RootType | 1.002 | 3.055 | 0.938 | 0.789 | 0.435 | 1.411 | 0.332 | 1.663 | 0.226 |
BART | PitchClass | 0.965 | 2.996 | 0.913 | 0.789 | 0.425 | 1.413 | 0.391 | 1.841 | 0.281 |
BART | RootPC | 0.948 | 2.923 | 0.917 | 0.779 | 0.419 | 1.421 | 0.384 | 1.781 | 0.259 |
GPT-2 | ChordSymbol | 0.839 | 2.650 | 0.881 | 0.766 | 0.376 | 1.444 | 0.355 | 1.810 | 0.236 |
GPT-2 | RootType | 0.803 | 2.623 | 0.822 | 0.779 | 0.407 | 1.423 | 0.300 | 1.610 | 0.192 |
GPT-2 | PitchClass | 0.898 | 2.692 | 0.851 | 0.807 | 0.447 | 1.392 | 0.345 | 1.734 | 0.235 |
GPT-2 | RootPC | 0.803 | 2.623 | 0.822 | 0.779 | 0.407 | 1.423 | 0.299 | 1.610 | 0.192 |
Model + Tokenizer | FAD (909, MERT) | FAD (Internal, MERT) |
---|---|---|
BART_PitchClassTokenizer | 11.6472 | 2.8262 |
BART_RootPCTokenizer | 11.6954 | 2.8432 |
BART_RootTypeTokenizer | 11.8291 | 2.3438 |
BART_ChordSymbolTokenizer | 12.0435 | 2.2630 |
GPT_PitchClassTokenizer | 11.7803 | 2.8694 |
GPT_RootPCTokenizer | 11.6531 | 2.7705 |
GPT_RootTypeTokenizer | 11.9934 | 2.3962 |
GPT_ChordSymbolTokenizer | 11.8622 | 2.3069 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Kaliakatsos-Papakostas, M.; Makris, D.; Soiledis, K.; Tsamis, K.-T.; Katsouros, V.; Cambouropoulos, E. HarmonyTok: Comparing Methods for Harmony Tokenization for Machine Learning. Information 2025, 16, 759. https://doi.org/10.3390/info16090759
Kaliakatsos-Papakostas M, Makris D, Soiledis K, Tsamis K-T, Katsouros V, Cambouropoulos E. HarmonyTok: Comparing Methods for Harmony Tokenization for Machine Learning. Information. 2025; 16(9):759. https://doi.org/10.3390/info16090759
Chicago/Turabian StyleKaliakatsos-Papakostas, Maximos, Dimos Makris, Konstantinos Soiledis, Konstantinos-Theodoros Tsamis, Vassilis Katsouros, and Emilios Cambouropoulos. 2025. "HarmonyTok: Comparing Methods for Harmony Tokenization for Machine Learning" Information 16, no. 9: 759. https://doi.org/10.3390/info16090759
APA StyleKaliakatsos-Papakostas, M., Makris, D., Soiledis, K., Tsamis, K.-T., Katsouros, V., & Cambouropoulos, E. (2025). HarmonyTok: Comparing Methods for Harmony Tokenization for Machine Learning. Information, 16(9), 759. https://doi.org/10.3390/info16090759