Diffusion-Inspired Masked Language Modeling for Symbolic Harmony Generation on a Fixed Time Grid

Kaliakatsos-Papakostas, Maximos; Makris, Dimos; Soiledis, Konstantinos; Tsamis, Konstantinos-Theodoros; Katsouros, Vassilis; Cambouropoulos, Emilios

doi:10.3390/app15179513

This is an early access version, the complete PDF, HTML, and XML versions will be available soon.

Open AccessArticle

Diffusion-Inspired Masked Language Modeling for Symbolic Harmony Generation on a Fixed Time Grid

by

Maximos Kaliakatsos-Papakostas

^1,2,3,*,

Dimos Makris

^1,3

,

Konstantinos Soiledis

^1,3

,

Konstantinos-Theodoros Tsamis

^1,3,

Vassilis Katsouros

^1,2 and

Emilios Cambouropoulos

⁴

¹

Department of Music Technology and Acoustics, Hellenic Mediterranean University, 74100 Rethymno, Greece

²

Institute of Language and Speech Processing, Athena RC, 15125 Marousi, Greece

³

Archimedes, Athena RC, 15125 Marousi, Greece

⁴

School of Music Studies, Aristotle University of Thessaloniki, 57001 Thessaloniki, Greece

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(17), 9513; https://doi.org/10.3390/app15179513 (registering DOI)

Submission received: 1 August 2025 / Revised: 22 August 2025 / Accepted: 28 August 2025 / Published: 29 August 2025

(This article belongs to the Special Issue The Age of Transformers: Emerging Trends and Applications)

Download Review Reports Versions Notes

Abstract

We present a novel encoder-only Transformer model for symbolic music harmony generation, based on a fixed time-grid representation of melody and harmony. Inspired by denoising diffusion processes, our model progressively unmasks harmony tokens over a sequence of discrete stages, learning to reconstruct the full harmonic structure from partial context. Unlike autoregressive models, this formulation enables flexible, non-sequential generation and supports explicit control over harmony placement. The model is stage-aware, receiving timestep embeddings analogous to diffusion timesteps, and is conditioned on both a binary piano roll and a pitch class roll to capture melodic context. We explore two unmasking schedules—random token revealing and midpoint doubling—both requiring a fixed and significantly reduced number of model calls at inference time. While our approach achieves competitive performance with strong autoregressive baselines (GPT-2 and BART) across several harmonic metrics, its key advantages lie in controllability, structured decoding with fixed inference steps, and alignment with musical structure. Ablation studies further highlight the role of stage awareness and pitch class conditioning. Our results position this method as a viable and interpretable alternative for symbolic harmony generation and a foundation for future work on structured, controllable musical modeling.

Keywords: generative MLM; grid-based time steps; melodic harmonization; transformers

Share and Cite

MDPI and ACS Style

Kaliakatsos-Papakostas, M.; Makris, D.; Soiledis, K.; Tsamis, K.-T.; Katsouros, V.; Cambouropoulos, E. Diffusion-Inspired Masked Language Modeling for Symbolic Harmony Generation on a Fixed Time Grid. Appl. Sci. 2025, 15, 9513. https://doi.org/10.3390/app15179513

AMA Style

Kaliakatsos-Papakostas M, Makris D, Soiledis K, Tsamis K-T, Katsouros V, Cambouropoulos E. Diffusion-Inspired Masked Language Modeling for Symbolic Harmony Generation on a Fixed Time Grid. Applied Sciences. 2025; 15(17):9513. https://doi.org/10.3390/app15179513

Chicago/Turabian Style

Kaliakatsos-Papakostas, Maximos, Dimos Makris, Konstantinos Soiledis, Konstantinos-Theodoros Tsamis, Vassilis Katsouros, and Emilios Cambouropoulos. 2025. "Diffusion-Inspired Masked Language Modeling for Symbolic Harmony Generation on a Fixed Time Grid" Applied Sciences 15, no. 17: 9513. https://doi.org/10.3390/app15179513

APA Style

Kaliakatsos-Papakostas, M., Makris, D., Soiledis, K., Tsamis, K.-T., Katsouros, V., & Cambouropoulos, E. (2025). Diffusion-Inspired Masked Language Modeling for Symbolic Harmony Generation on a Fixed Time Grid. Applied Sciences, 15(17), 9513. https://doi.org/10.3390/app15179513

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Diffusion-Inspired Masked Language Modeling for Symbolic Harmony Generation on a Fixed Time Grid

Abstract

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI