MDPI - Publisher of Open Access Journals

25 pages, 3702 KB

Open AccessArticle

MELT: Optimization-Driven Music Emotion Learning with Temporal Token-Level Fusion

by Yihe Yin, Zhen Tian and Junming Chen

Mathematics 2026, 14(10), 1690; https://doi.org/10.3390/math14101690 - 15 May 2026

Viewed by 283

Music emotion recognition (MER) can be formulated as a multimodal optimization problem that predicts an emotion label from coupled audio and lyric sequences. Existing methods typically perform unimodal learning or coarse global fusion, which overlooks fine-grained temporal-token correspondences between musical dynamics and lyric semantics. We propose MELT (Music Emotion Learning with Temporal token-level fusion), an optimization-driven framework with four modules: a BERT-based lyrics semantic encoder (LSE), a segment temporal encoder (STE) that models audio-segment dependencies via a Transformer, a token-level temporal fusion (TTF) module with gated cross-attention, and an emotion mood head (EMH) for four-class prediction. Training is conducted end-to-end by jointly minimizing a supervised classification term and an auxiliary cross-modal contrastive alignment term, yielding a unified objective that improves both class separability and representation consistency. On the MoodyLyrics benchmark, MELT achieves 87.6% weighted F1 for four-class emotion recognition (angry, happy, relaxed, sad), outperforming unimodal baselines and representative early/late fusion strategies. Ablation results further verify that temporal encoding, gated token-level fusion, and joint optimization each contribute to the final performance. Full article

(This article belongs to the Special Issue Intelligent Mathematics and Applications)

► Show Figures

Figure 1

26 pages, 4013 KB

Open AccessEditor’s ChoiceArticle

Music Genre Classification Using Prosodic, Stylistic, Syntactic and Sentiment-Based Features

by Erik-Robert Kovacs and Stefan Baghiu

Big Data Cogn. Comput. 2025, 9(11), 296; https://doi.org/10.3390/bdcc9110296 - 19 Nov 2025

Viewed by 3907

Abstract

Romanian popular music has had a storied history across the last century and a half. Incorporating different influences at different times, today it boasts a wide range of both autochthonous and imported genres, such as traditional folk music, rock, rap, pop, and manele, to name a few. We aim to trace the linguistic differences between the lyrics of these genres using natural language processing and a computational linguistics approach by studying the prosodic, stylistic, syntactic, and sentiment-based features of each genre. For this purpose, we have crawled a dataset of ~14,000 Romanian songs from publicly available websites along with the user-provided genre labels, and characterized each song and each genre, respectively, with regard to these features, discussing similarities and differences. We improve on existing tools for Romanian language natural language processing by building a lexical analysis library well suited to song lyrics or poetry which encodes a set of 17 linguistic features. In addition, we build lexical analysis tools for profanity-based features and improve the SentiLex sentiment analysis library by manually rebalancing its lexemes to overcome the limitations introduced by it having been machine translated into Romanian. We estimate the accuracy gain using a benchmark Romanian sentiment analysis dataset and register a 25% increase in accuracy over the SentiLex baseline. The contribution is meant to describe the characteristics of the Romanian expression of autochthonous as well as international genres and provide technical support to researchers in natural language processing, musicology or the digital humanities in studying the lyrical content of Romanian music. We have released our data and code for research use. Full article

(This article belongs to the Special Issue Artificial Intelligence (AI) and Natural Language Processing (NLP))

► Show Figures

Figure 1

23 pages, 952 KB

Open AccessArticle

Multi-Modal Song Mood Detection with Deep Learning

by Konstantinos Pyrovolakis, Paraskevi Tzouveli and Giorgos Stamou

Sensors 2022, 22(3), 1065; https://doi.org/10.3390/s22031065 - 29 Jan 2022

Cited by 42 | Viewed by 9425

Abstract

The production and consumption of music in the contemporary era results in big data generation and creates new needs for automated and more effective management of these data. Automated music mood detection constitutes an active task in the field of MIR (Music Information Retrieval). The first approach to correlating music and mood was made in 1990 by Gordon Burner who researched the way that musical emotion affects marketing. In 2016, Lidy and Schiner trained a CNN for the task of genre and mood classification based on audio. In 2018, Delbouys et al. developed a multi-modal Deep Learning system combining CNN and LSTM architectures and concluded that multi-modal approaches overcome single channel models. This work will examine and compare single channel and multi-modal approaches for the task of music mood detection applying Deep Learning architectures. Our first approach tries to utilize the audio signal and the lyrics of a musical track separately, while the second approach applies a uniform multi-modal analysis to classify the given data into mood classes. The available data we will use to train and evaluate our models comes from the MoodyLyrics dataset, which includes 2000 song titles with labels from four mood classes, {happy, angry, sad, relaxed}. The result of this work leads to a uniform prediction of the mood that represents a music track and has usage in many applications. Full article

(This article belongs to the Collection Convolutional Neural Networks Applications in Sensing and Imaging: Architectures, Insight, Visualization, Transparency)

► Show Figures

Figure 1

38 pages, 1405 KB

Open AccessArticle

Statistical and Visual Analysis of Audio, Text, and Image Features for Multi-Modal Music Genre Recognition

by Ben Wilkes, Igor Vatolkin and Heinrich Müller

Entropy 2021, 23(11), 1502; https://doi.org/10.3390/e23111502 - 12 Nov 2021

Cited by 6 | Viewed by 6229

Abstract

We present a multi-modal genre recognition framework that considers the modalities audio, text, and image by features extracted from audio signals, album cover images, and lyrics of music tracks. In contrast to pure learning of features by a neural network as done in the related work, handcrafted features designed for a respective modality are also integrated, allowing for higher interpretability of created models and further theoretical analysis of the impact of individual features on genre prediction. Genre recognition is performed by binary classification of a music track with respect to each genre based on combinations of elementary features. For feature combination a two-level technique is used, which combines aggregation into fixed-length feature vectors with confidence-based fusion of classification results. Extensive experiments have been conducted for three classifier models (Naïve Bayes, Support Vector Machine, and Random Forest) and numerous feature combinations. The results are presented visually, with data reduction for improved perceptibility achieved by multi-objective analysis and restriction to non-dominated data. Feature- and classifier-related hypotheses are formulated based on the data, and their statistical significance is formally analyzed. The statistical analysis shows that the combination of two modalities almost always leads to a significant increase of performance and the combination of three modalities in several cases. Full article

(This article belongs to the Special Issue Artificial Intelligence and Complexity in Art, Music, Games and Design II)

► Show Figures

Figure 1

Search Results (4)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Saved Queries

Search Filter Reset All

Years

Feature Papers

Subjects

Journals

Article Types

Countries / Regions

Search Results (4)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI