DeepHits: A Multimodal CNN Approach to Hit Song Prediction

Nofer, Michael; Nimani, Valdrin; Hinz, Oliver

doi:10.3390/make8030058

Open AccessArticle

DeepHits: A Multimodal CNN Approach to Hit Song Prediction

by

Michael Nofer

^*,

Valdrin Nimani

and

Oliver Hinz

Department of Information Systems and Information Management, Faculty of Economics and Business, Goethe University Frankfurt, Theodor-W.-Adorno-Platz 4, 60323 Frankfurt am Main, Germany

^*

Author to whom correspondence should be addressed.

Mach. Learn. Knowl. Extr. 2026, 8(3), 58; https://doi.org/10.3390/make8030058

Submission received: 1 January 2026 / Revised: 13 February 2026 / Accepted: 24 February 2026 / Published: 2 March 2026

Download

Browse Figures

Versions Notes

Abstract

Hit Song Science aims to forecast a song’s success before release and benefits from integrating signals beyond audio content alone. We present DeepHits, an end-to-end multimodal network that combines (i) log-Mel spectrogram embeddings from a compact residual 2D-CNN, (ii) frozen multilingual BERT lyric embeddings, and (iii) structured numeric features including high-level Spotify audio descriptors and contextual metadata (artist popularity, release year). Evaluated on 92,517 tracks from the SpotGenTrack dataset, DeepHits achieves a macro-F1 of 52.20% (accuracy 82.63%) in the established three-class setting and a macro-F1 of 23.15% (accuracy 37.00%) in a ten-class decile benchmark. To contextualize fine-grained performance, we report capacity-controlled shallow baselines, including metadata-only and early/late fusion variants, and show that the deep multimodal model provides a clear gain over these references (e.g., metadata-only: macro-F1 20.92%; accuracy 34.22%). Ablation results indicate that removing metadata yields the largest degradation in class-balanced performance, highlighting the strong predictive value of artist popularity and release year. Overall, DeepHits provides a reproducible benchmark and modality analysis for fine-grained popularity prediction under class imbalance.

Keywords:

music information retrieval; hit song science; deep learning; multimodal learning; convolutional neural networks; CNN; ablation study; log-Mel spectrogram

1. Introduction

Over the past decade, the music industry has undergone transformative changes in the ways music is produced, distributed, and consumed. Online streaming platforms like Spotify, Apple Music, and SoundCloud have generated vast amounts of music data, giving rise to Music Information Retrieval (MIR). MIR is a rapidly growing research area that uses data-driven methods to analyze and understand music. Within MIR, Hit Song Science (HSS) has become a specialized area focused on predicting whether songs will become commercially successful before their release [1].

In this work, we operationalize “song success” using Spotify’s popularity score provided in the SpotGenTrack dataset. This label is a continuous value on a 0–100 scale, where higher values indicate stronger listener engagement and mainstream traction on the platform. We study success prediction in two complementary formulations: (i) regression, which predicts the continuous popularity score, and (ii) classification, which maps popularity into interpretable success tiers. Specifically, for the three-level setting, we follow a low/mid/high scheme with thresholds 0–25 (low), 25–65 (mid), and 65–100 (high), and for fine-grained benchmarking, we additionally introduce a ten-level setting by dividing the 0–100 range into deciles (with the top decile covering 90–100). Throughout the paper, the term hit refers to songs in the upper success tiers under these operational definitions.

According to Pachet and Roy [2], HSS aims to identify features that make a song popular through computational analysis. This involves examining pre-release data, such as lyrics, audio characteristics, and metadata, to uncover patterns that signal commercial success. The practical significance of these predictive models is highlighted by the scale of the recorded music industry, which generated USD 28.6 billion in revenue in 2023 [3]. By forecasting a song’s popularity with accuracy, these models can provide valuable insights to industry stakeholders, helping them make informed decisions about marketing, investment, and promotion.

Initial studies in the field of HSS utilized conventional machine learning algorithms to address the task of hit prediction. For example, Dhanaraj and Logan [4] trained support vector machines and boosting classifiers on features such as Mel-Frequency Cepstral Coefficients (MFCCs) and topic-based lyric features to discriminate between hits and non-hits. While these approaches exhibited moderate predictive power, their reliance on manually engineered feature sets constrained their effectiveness. Pachet and Roy [2] later broadened the feature set by incorporating higher-level annotations and proprietary audio descriptors; nonetheless, their findings remained inconclusive, casting doubt on the viability of HSS.

However, these early methods in the field of HSS had significant limitations. Firstly, most models were unimodal, analyzing only one modality (typically audio), and did not effectively combine audio and lyrics. Many studies focused exclusively on audio features, while lyrical content was either ignored or treated in isolation. Even when lyrics were considered, they were often represented by simple bag-of-words or “raw term” features rather than being truly fused with audio data [5]. Second, the feature sets were largely handcrafted and simplistic. For example, the authors of [4] used low-level spectral descriptors (e.g., MFCCs) and simple topic-based lyric features, while the authors of [2] expanded this with the use of higher-level audio descriptors; however, these manually engineered features proved insufficient for robust prediction. Indeed, Pachet and Roy’s [2] large-scale study found that the commonly used audio features (timbre, tempo, etc.) were not reliable predictors of a song’s success. Additionally, the authors of [5] noted that adding more attributes resulted in only marginal gains.

In addition, classical approaches offered little interpretability—they could classify hits versus non-hits but rarely explained which specific features drove these predictions. This lack of transparency meant that researchers and practitioners gained limited insight into why a song was predicted to succeed or fail. To meet both explanatory needs—and to remain directly comparable with earlier Hit Song Science studies that report results in both formats—we consider two complementary settings: (i) regression, which preserves Spotify’s continuous 0–100 popularity scale, and (ii) classification, whose discrete bins provide interpretable categories and actionable class scores. Details of the discretization into success tiers are provided in Equation (2) and Equation (3). Early studies typically did not include analyses of feature importance or model decision logic, treating the prediction mechanism as a black box. Finally, none of these pre-deep-learning methods were end-to-end learnable. Feature extraction and modeling were disjoint steps, and the models could not adjust or learn feature representations during training.

In recent years, the introduction of deep learning into MIR has unlocked new avenues for enhancing hit song prediction [6,7]. Deep neural networks can automatically learn complex, hierarchical representations from raw musical inputs, potentially circumventing the limitations inherent in manual feature engineering. As a result, contemporary research has increasingly adopted end-to-end architectures that fuse multiple data modalities (audio, lyrics, and metadata) within a single predictive framework [8].

Table 1 reviews prior work in Hit Song Science. Studies marked ✓ in the ‘Benchmark’ column are the most directly comparable recent multimodal hit-song prediction systems and serve as benchmark baselines in Section 3; the remaining studies are included for historical completeness.

Several distinctions set our approach apart from these reference methods. Across the board, earlier studies devote little space to explaining why their multimodal pipelines succeed; for example, headline metrics are reported, but the contribution of each modality is seldom dissected. Zangerle et al. [9] combine low- and high-level audio features with release-year metadata to predict Billboard hits, yet omit lyrical information altogether. Vavaroutsos and Vikatos [10] extend that framework by adding a BERT-based lyric branch while deliberately keeping the audio pathway lightweight. Instead of high-resolution spectrograms, they feed a 2D CNN with 13 × 13 MFCC patches, producing a heavily compressed representation of the signal. The resulting “MFCC + lyrics” system improves accuracy by roughly eight percentage points compared to the previous best, suggesting that the gain is driven mainly by text rather than by a sophisticated audio encoder. Therefore, the specific role of the audio branch remains opaque. Martín-Gutiérrez et al. [11] go a step further in terms of fusion, but publish only aggregate end-to-end scores, leaving modality-level insights unexplored.

Building on prior work, we present DeepHits, an end-to-end architecture that jointly exploits spectrogram-based audio CNNs, BERT-based lyric embeddings, and structured release metadata. To the best of our knowledge, this is the first model to unify all three information sources for fine-grained, ten-class hit song prediction. A comprehensive ablation study disentangles the contribution of each modality and pinpoints where performance gains originate, offering clearer insight than earlier two-stream approaches.

In summary, our contributions are as follows:

Multimodal Integration: We integrate heterogeneous data sources—low-level and high-level audio features, song lyrics, and metadata (e.g., artist popularity)—into a single deep learning model for hit prediction.
Spectrogram-Based CNN: Although convolutional neural networks (CNNs) have previously been employed in HSS [6,12], most existing work has focused on audio-only pipelines or has lacked the simultaneous integration of multiple modalities. By applying 2D CNNs to the log-Mel spectrograms of audio signals, we aim to capture fine-grained temporal and frequency patterns that are often difficult to detect through manually engineered features. Crucially, we combine these CNN-derived audio embeddings with lyrical and metadata inputs, thereby extending the use of CNNs beyond isolated audio analysis and enabling a richer, multimodal representation of each track. While several recent studies have explored multimodal approaches that combine audio, lyrics, and metadata [9,10,11], they do not incorporate advanced 2D CNN architectures applied to spectrograms. We thus advance prior multimodal efforts by integrating spectrogram-based 2D CNNs within a unified predictive framework, allowing for more sophisticated audio representation learning in tandem with textual and contextual features.

The rest of this paper is structured as follows: Section 2 details the dataset, model implementation, and experimental setup. Section 3 reports our results, encompassing both the classification and regression analyses, as well as the outcomes of the ablation study. Section 4 summarizes the main findings, while Section 5 concludes the paper by discussing the implications of the findings and outlining directions for future work.

Table 1. Prior work in Hit Song Science.

Study	Method Type	Modalities	Dataset	Results	Benchmark
Dhanaraj and Logan (2005) [4]	SVM and boosting classifiers	Audio (MFCCs) and lyrics (topics)	Hit vs. non-hit songs (Billboard charts)	Some predictive power, but limited by simple features (accuracy modestly above chance)	-
Pachet and Roy (2008) [2]	Feature-based classification	Audio (proprietary descriptors)	32,000 songs (hits vs. others)	Inconclusive results; no robust hit prediction achieved (HSP “not yet a science”)	-
Yang et al. (2017) [6]	Deep CNN on spectrograms	Audio only	Western and Chinese pop hits (streaming play-count data)	Deep model outperformed shallow models in popularity prediction (higher accuracy than MLP/SVM)	-
Oramas et al. (2017) [8]	Deep multimodal pipeline: text-FFN + audio-CNN => late-fusion MLP	Audio (CQT) and artist biography text	MSD-A (328,000 tracks/24,000 artists, Million Song Dataset + metadata)	Cold-start song recommendation.: MAP@500 = 0.0036	-
Delbouys et al. (2018) [13]	CNN (audio) and LSTM (lyrics)—tested early fusion vs. late fusion	Audio and lyrics	18,644 songs (Deezer Mood Detection Dataset)	Arousal: R² 0.235 (audio deep > classical); valence: mid-fusion R² 0.219 > unimodal (best fusion)	-
Zangerle et al. (2019) [9]	Wide and deep neural network (dense features and deep audio net)	Audio (MFCCs, high-level features) and Metadata (release year)	Million Song Dataset ± Billboard Hot 100 labels (11,000 songs)	Acc 75%; fusion > low- or high-level alone	✓
Martín-Gutiérrez et al. (2020) [11]	Deep autoencoder and fully connected DNN	Audio and lyrics and metadata	SpotGenTrack Popularity Dataset, 101,939 tracks scraped from Spotify + Genius	~83% accuracy in three-class popularity classification (outperforms prior models by integrating all modalities)	✓
Vavaroutsos and Vikatos (2023) [10]	Deep multimodal net and triplet loss (metric learning); CNN handles low-level audio	Audio, lyrics, and metadata (year)	Hit Song Prediction Dataset + Genius lyrics (11,634 songs, balanced)	~80% accuracy for hit vs. non-hit prediction—about 8% higher than previous audio-only baseline (demonstrates lyric and temporal features boost performance)	✓
Choi et al. (2017) [7]	Convolutional Recurrent Neural Network (CRNN)	Audio (mel-spectrogram)	Million Song Dataset (MSD), tagwise AUC evaluation	AUC 0.8950 (best CRNN model)	-

The column “Benchmark” indicates whether the study is used as a direct baseline comparison in this paper.

2. Materials and Methods

2.1. Dataset

The dataset for this study is drawn from the SpotGenTrack Popularity Dataset, which is a widely used resource for multimodal deep learning approaches for hit song prediction [14]. To capture global listening trends, the original curators queried every Top 50 playlist across major categories across 26 Spotify territories, gathering every track that appeared. The final collection contains 101,939 songs by 56,129 artists from 75,511 albums, which were released between 1957 and 2020. Most artists and albums contribute only a small number of tracks (long-tail distribution), indicating that the dataset is not dominated by a few prolific artists or albums (Figure 1).

A static copy curated by the original authors is available on Mendeley, providing the full set of audio previews, metadata, and lyric links [14]. It integrates a rich assortment of feature types to support an in–depth analysis of song popularity. The dataset provides high–level Spotify audio descriptors (e.g., danceability, energy, tempo) as well as contextual metadata (artist popularity, release year) and lyric resources/embeddings. In addition, we derive low–level audio representations by converting the 30 s previews into log–Mel spectrograms suitable for CNN input.

Table 2 summarizes the dataset at a glance, including the representation, number of features, dataset size, and the number of classes used throughout this work.

This multimodal design gives each track a holistic fingerprint that covers intrinsic musical traits and external context alike. Such breadth is essential for modern deep learning pipelines, which learn to fuse heterogeneous cues rather than rely on any single feature family [13]. Figure 2 illustrates how the raw audio, lyric, and metadata streams are converted into modality-specific feature representations and subsequently encoded, fused, and mapped to the final popularity prediction output.

High-Level Audio Descriptors (Spotify Audio Features)

High-level audio descriptors (Spotify audio features) are supplied by Spotify’s proprietary machine learning models. As listed in Table 3, they cover attributes such as danceability, energy, and acousticness, each capturing a different aspect of a track’s musical or acoustic profile. Spotify also assigns every song a popularity score ranging from 0 to 100. This composite metric is driven by both the cumulative number of streams and by how recent those streams are; it serves as the target variable for our hit prediction experiments.

The popularity scores follow an almost normal distribution, meaning most songs fall in the middle of the scale and only a small fraction achieve extremely high or low values.

Figure 3 visualizes the resulting class distributions for both labelings. While the 3-class setting is comparatively balanced, the 10-class decile formulation is strongly long-tailed, with very few tracks in the highest popularity tiers. Therefore, we consistently report macro-averaged metrics in addition to accuracy, and we include imbalance-aware control baselines (Section 3.3.1).

A closer look at Spotify’s high-level audio descriptors over time reveals clear patterns that mirror shifting listener tastes. Acousticness declined steadily from the 1960s to the 2000s, echoing the rise in electronic production, before edging upward again in the 2010s. In contrast, both danceability and energy have increased almost continuously across the decades, whereas valence, which is a proxy for musical positivity, has demonstrated a downward trend. More recently, instrumentalness and speechiness have risen, suggesting growing interest in instrumental passages and spoken-word elements. Taken together, these temporal trends underscore the fluid nature of popular music and illustrate why models that capture time dynamics are better equipped to predict future hits.

Low-Level Audio (Log-Mel Spectrograms)

The dataset provides 30 s audio previews but does not include precomputed spectrogram tensors for CNN input; therefore, we convert each preview into a log-Mel spectrogram. Specifically, we downloaded 30 s MP3 previews via the Spotify API and converted each clip into a Mel spectrogram, an effective input representation for deep learning models.

The conversion pipeline proceeded in two steps. First, the preview signal was projected into the frequency domain with a short-time Fourier Transform (STFT) to create a linear-frequency spectrogram. Second, we passed the spectrogram through a Mel filter bank that compresses higher frequencies and expands lower ones to approximate human auditory perception, and we applied a logarithmic amplitude scale to improve interpretability.

For consistency, we used the following parameters in the conversion pipeline: a sampling rate of 22,050 Hz, an FFT window size of 1024, a hop length of 512, and 128 Mel frequency bands. The resulting log-Mel spectrograms serve as input to a convolutional neural network, allowing automated feature extraction that supports the accurate prediction of song popularity.

High-level audio descriptors derived with the Essentia library (Spotify-style)

For tracks that are not available on Spotify, we recreated Spotify-style high-level descriptors using the Essentia library [15], an open-source audio analysis toolkit. Essentia provides transparent, scalable algorithms for measures such as spectral complexity, rhythmic strength, and overall energy, allowing us to generate Spotify-style features that anyone can reproduce. Descriptors missing from Essentia (speechiness and liveness) were therefore excluded from our cross-platform experiments.

Missing-descriptor sensitivity (speechiness, liveness)

Our Essentia-based cross-platform feature configuration does not currently reproduce Spotify’s speechiness and liveness descriptors; consequently, these two features are omitted when applying the pipeline to non-Spotify tracks. To quantify the potential impact of this mismatch, we conducted a sensitivity analysis on the Spotify subset by training a capacity-controlled early-fusion baseline with (i) all descriptors, (ii) without speechiness, (iii) without liveness, and (iv) without both. As summarized in Appendix A Table A8, removing liveness has a negligible effect, while removing speechiness leads to only a small reduction in 10-class macro-F1 (~1.2 percentage points) with virtually unchanged accuracy. In the 3-class setting, the impact is negligible (<0.1 percentage points). This suggests that cross-platform use without these two descriptors is unlikely to substantially alter overall performance trends, although improving the replication of such descriptors remains a direction for future work.

Lyric Embeddings

We retrieved song lyrics from the SpotGenTrack Popularity Dataset using the Genius API and encoded them with a multilingual BERT model [16], following Vavaroutsos and Vikatos [10]. We use a pre-trained multilingual BERT model (no fine-tuning) and extract a fixed 768-D pooled [CLS] representation (pooler output) for each track. These precomputed 768-D vectors are passed to our lyric branch, a two-layer MLP (768→256→256, ReLU), which produces a compact lyric embedding for early fusion. We chose the multilingual variant to support non-English lyrics in a multi-territory corpus without language-specific filtering, while keeping a single uniform text encoder for all tracks.

Importantly, lyrics availability is imperfect, and a non-negligible fraction of lyric texts exceed the 512-token context window; thus, a fair fine-tuning setup would require careful handling of missing-lyrics cases and long-sequence encoding (e.g., sliding windows and pooling). For transparency, we summarize lyrics coverage and length statistics in Appendix A Table A3 (and Appendix A Figure A2).

We use static lyric embeddings as they provide a strong and efficient baseline in MIR and NLP [17,18], avoiding the substantial computational and tuning overhead of task-specific transformer fine-tuning. While prior work suggests that fine-tuned transformer encoders can improve MIR tasks [19], we treat fine-tuning on the popularity prediction objective (and domain-adaptive pretraining on large lyric corpora) as an important direction for future work, particularly for the fine-grained 10-class setting.

Metadata Features

We supplement the audio and lyric inputs with two pieces of contextual metadata—release year and artist popularity. Release year places each song in its historical moment, allowing the model to learn how listener tastes and production styles change over time. Artist popularity, represented by Spotify’s 0–100 score that reflects the audience reach of an artist’s entire catalog, captures the market advantage of acts with an established fan base. Together, these metadata cues add external context that is known to influence commercial success.

Data Preprocessing

To ensure consistent data quality, we applied three filtering steps to focus on songs with reliable 30 s audio inputs and to reduce non-song spoken-word content. First, we removed speech-dominated items (e.g., podcasts and audiobooks) by discarding tracks with a Spotify speechiness score above 0.66, reducing the corpus from 101,939 to 96,577 tracks. This cutoff follows Spotify’s interpretive guideline that values above 0.66 are likely composed predominantly of spoken words, while mid-range values (0.33–0.66) can include speech-heavy music such as rap [20]. Second, we excluded tracks longer than seven minutes, yielding 92,785 tracks. Third, we dropped tracks shorter than 30 s; the CNN requires at least that much audio to extract reliable low-level features, producing a final working set of 92,517 tracks (90.8% of the raw corpus). Table 4 summarizes the impact of each filtering step on the dataset size.

To assess whether the speechiness filter could inadvertently bias against rap/hip-hop, we inspected Spotify genre tags of the removed tracks. The excluded set is dominated by spoken-word genres (e.g., audio drama, reading, poetry, comedy; Table A6), whereas rap/hip-hop accounts for only 52 of 5362 speech-filtered tracks (<1%) and <0.4% of all rap/hip-hop tracks in the raw corpus. Moreover, a sensitivity sweep around the chosen cutoff (0.60/0.66/0.70) yields very similar dataset sizes (±0.3%; Table A7), indicating that our conclusions are not sensitive to minor threshold perturbations.

All numerical high-level descriptors and metadata were standardized with z-scores. The categorical fields “key” and “mode” were one-hot encoded—“mode” became a single binary flag, while “key” expanded into thirteen dimensions (the twelve pitch classes plus an “undefined” category). Log-Mel spectrogram values were also z-score normalized, ensuring that every input feature was on a comparable scale.

2.2. Implementation

2.2.1. Model Architecture

The DeepHits base architecture consists of three distinct modules. Each module processes a specific type of input feature—(1) low-level audio, (2) lyric embeddings, and (3) structured numeric features (high-level audio descriptors + metadata)—and outputs a corresponding vector. We then concatenate these three vectors into a single representation and feed it into the unified prediction layer. The complete CNN architecture, including kernel sizes, output shapes, and parameter counts, is listed in Table A1.

We adopt a compact residual 2D-CNN for the log-Mel spectrogram input because 2D convolutions are well-suited to capture local time–frequency patterns, while residual connections support stable optimization in deeper feature extractors. The audio branch is intentionally kept lightweight (three residual blocks followed by adaptive average pooling and two fully connected layers) to limit overfitting risk and to keep the overall multimodal model deployable and comparable across variants. Consistent with the early-fusion design, the audio encoder outputs a fixed 64-dimensional embedding, which keeps the fused representation manageable (384-D after concatenation with the lyric and numeric branches) while still allowing the model to learn joint representations end-to-end. The exact layer configuration (kernels, strides, shapes, and parameter counts) is provided in Table A1.

We did not perform an exhaustive architecture search; instead, we selected this compact ResNet-style configuration as a pragmatic trade-off between representation capacity and overfitting risk, guided by established spectrogram-CNN practice in MIR and by our goal of keeping the fused model lightweight and reproducible.

We fuse the modalities early, through direct concatenation, because extensive MIR research shows that this simple strategy already yields large gains over unimodal baselines. It has proved effective in drum transcription, multimodal genre classification [21], and recent large-scale hit song predictors [11]. Comparative studies report that late fusion and attention mechanisms rarely deliver consistent improvements and add substantial computational overhead [13]. In music mood detection, for instance, adding lyrics via late fusion conferred no benefit over an audio-only model, whereas an integrated (mid/early) fusion produced sizeable gains in valence prediction [13]. Surveys and tutorials further stress that the marginal accuracy improvements offered by attention-based mechanisms often fail to offset their added architectural and computational complexity in practice [21]. Collectively, these findings support our choice to concatenate modality-specific embeddings before the prediction layer, allowing the network to learn joint representations while keeping the model compact, reproducible, and easily deployable.

Although the raw inputs differ substantially in dimensionality (log-Mel spectrogram tensors, 768-D multilingual BERT representations, and 26-D structured numeric features after encoding), each modality branch maps its input into a compact latent vector before fusion (audio: 64-D; structured numeric: 64-D; lyrics: 256-D). These embeddings are concatenated into a 384-D fused representation, keeping the fusion stage lightweight while allowing the network to learn modality weights end-to-end. Recent hit song predictors and other cross-domain systems fuse high-dimensional video/audio features with only a handful of scalar metadata cues, reaching a state-of-the-art performance without the need for explicit dimensional re-balancing [11]. In our implementation, each modality branch already applies a suitable normalization step—batch normalization for the deep audio/lyric embeddings and z-score normalization for the scalar numeric features (high-level audio descriptors and metadata)—so that the concatenated vectors share comparable feature scales; the subsequent fusion MLP can therefore learn modality weights automatically. We thus keep the original dimensionalities, which is consistent with our focus on intrinsic musical content rather than an extensive metadata catalog. The ablation results (Section 3.3.2) indicate that metadata provides the strongest predictive signal in our setting, while the current frozen lyric representations add only limited incremental information. This motivates future work on lyric-domain adaptation and task-specific fine-tuning.

Low-Level Audio Features (log-Mel spectrograms)

The first module processes log-Mel spectrograms through a series of convolutional layers. It begins with a 2D convolutional layer (kernel size = 5, stride = 2, padding = 2), followed by batch normalization to stabilize gradients and accelerate convergence. The convolutional weights are initialized using the Kaiming He method, which preserves variance across layers in deep neural networks with ReLU activations [22].

ReLU activation functions introduce the necessary non-linearity after each convolutional layer, helping to alleviate vanishing gradients. Subsequently, three residual blocks are applied, primarily to facilitate gradient flow (while also mitigating vanishing gradients), building on the success of residual networks. Each residual block contains two convolutional layers, each followed by batch normalization and ReLU activation. The first convolution within a residual block applies a stride of 2 to downsample the feature map, reducing spatial dimensions while increasing the depth of the feature maps. If dimension alignment is required, a 1 × 1 convolution followed by batch normalization is applied to match the input and output dimensions.

After the final residual block, adaptive average pooling reduces the 2D feature maps to a fixed 1 × 1 spatial dimension, producing an output vector whose length equals the number of output channels. This vector is then fed into two fully connected layers, each followed by a ReLU activation, to further refine and project the learned audio representations.

Lyric Embeddings (Multilingual BERT)

The second module processes the lyric embeddings produced by a multilingual BERT model. These embeddings originate from the CLS (classification) token of the final hidden state, providing a compact, context-aware representation of the lyrics. Since this module does not involve convolutional operations, it employs a simpler two-layer feedforward network with ReLU activations. This network reduces the dimensionality of the lyric embeddings and extracts features relevant to song popularity.

High-Level Numeric Features and Metadata

The third branch receives high-level audio descriptors and contextual metadata—danceability, energy, release year, artist popularity, etc.—and maps them into a shared latent space. It uses a two-layer multilayer perceptron with ReLU activations, matching the structure of the lyrics branch. This subnetwork transforms the standardized numerical inputs into a compact feature vector that encapsulates the salient information needed for accurate hit prediction.

Feature Fusion and Output Layer

After processing the distinct input modalities, we perform early fusion by concatenating the outputs of the three sub-networks into a single feature vector. This unified representation is then passed through a series of fully connected layers with ReLU activations, culminating in an output layer that produces either a continuous value (for regression) or a vector of class scores (softmax outputs) over popularity classes (for classification). These scores are not calibrated probabilities. By maintaining separate processing pathways for audio, textual, and metadata features, but ultimately learning a joint embedding, our architecture effectively captures both modality-specific patterns and their combined influence on song popularity.

2.2.2. Training Configuration

To obtain a representative training–validation split for the deep multimodal models, we follow the original protocol by randomly shuffling the dataset and partitioning it into 80% training and 20% validation sets. Given the large sample size (N = 92,517), this random partitioning is expected to closely approximate stratification. For the capacity-controlled shallow baselines and the multi-seed robustness check (Appendix A Table A4), we additionally use stratified 80/20 splits to explicitly preserve class ratios under strong imbalance.

Model training employed the Adam optimizer with the following hyperparameters: learning rate (α) = 0.01, exponential decay rate for the first-moment estimates (β₁) = 0.9, exponential decay rate for the second-moment estimates (β₂) = 0.999, and L2 weight decay (λ) = 0.001. The numerical-stability constant ε was left at its default value (ε = 10⁻⁸). We employed a OneCycleLR scheduler to modulate the learning rate throughout training (updated after each batch) [23]. This schedule begins at a low rate, linearly increases to a maximum learning rate of 0.1, and then decays, helping the optimizer escape local minima and saddle points and accelerating convergence. For regression, we optimized mean squared error (MSE) loss; for classification, we used cross-entropy loss.

After each epoch, we evaluated the model on the held-out validation split, saved a checkpoint, and selected the final checkpoint as the epoch with the best validation macro-F1 (accuracy reported as secondary). Unless stated otherwise, all deep-model metrics correspond to this selected checkpoint and are measured on the same held-out validation split. Preliminary experiments indicated convergence around 20 epochs, so we trained for 25 epochs with a batch size of 256.

2.2.3. Evaluation Metrics and Analysis

Since our task involves both regression and classification, we apply task-specific evaluation metrics accordingly.

Regression Models

We use mean squared error (MSE) to assess the performance of regression models, measuring the average squared difference between predicted and actual popularity scores, as follows:

M S E = \frac{1}{n} \sum_{i = 1}^{n} {({\hat{y}}_{i} - y_{i})}^{2}

(1)

MSE penalizes larger errors more heavily, making it suitable for detecting severe over- or underestimation of song popularity.

Classification Models

For classification, we evaluate overall accuracy, macro-precision, macro-recall, and macro-F1 based on the confusion matrix [24]. We derive hard predictions via argmax over the class-score vector (no fixed 0.5 threshold) and compute per-class metrics from the resulting confusion matrix; macro averages are the unweighted mean across classes. Given the class imbalance across popularity tiers, we prioritize macro-F1 in our interpretation. In addition, we report threshold-independent ranking metrics that are informative under imbalance, namely macro ROC-AUC and macro PR-AUC, computed in a one-vs-rest manner from the class scores and macro-averaged across classes. These metrics complement macro-F1 by summarizing class separability without fixing a decision threshold.

Regularization and overfitting control

To mitigate overfitting, we employ two complementary strategies. First, a dropout layer (p = 0.5) is inserted after the modality-fusion stage, randomly zeroing half of the activations during training. Second, the Adam optimizer uses weight decay (λ = 0.001), providing L2 regularization on all trainable parameters. Early stopping was evaluated in pilot runs and marginally reduced validation loss; however, we omitted it in the final experiments to ensure that the training schedule was identical across all model variants (fixed 25 epochs), thereby ensuring strict comparability. Together with the deliberately compact network size (~4.8 × 10⁵ parameters for 9.2 × 10⁴ songs), these measures proved sufficient to prevent performance divergence between training and validation sets.

3. Results

3.1. Regression vs. Classification

The popularity labels in the dataset are continuous values ranging from 0 to 100, making the task inherently suited for a regression formulation. Regression models output a continuous popularity score, which enables straightforward predictions. However, classification models offer additional benefits, producing interpretable outputs and confidence scores for each class; these model outputs (scores) can be used as a relative indication of prediction confidence (e.g., higher score or larger margin), but they are not calibrated probabilities and do not constitute a formal uncertainty estimate. A principled assessment of predictive uncertainty would require additional calibration or uncertainty estimation methods (e.g., temperature scaling, ensembles, or Bayesian approaches), which we leave for future work. Classification also permits the use of a broader range of performance metrics, such as precision, recall, and F1-score, which are invaluable for comprehensive model evaluation. We report macro-averaged precision, recall, and F1-score throughout to treat each popularity tier equally.

To bridge the gap between regression and classification, we adopt a hybrid approach in which regression outputs are discretized into classes, following previous studies [10,11].

Data split and stratification

For all baseline comparisons reported in Table 6, we use a stratified 80/20 split to preserve the class distribution across partitions under strong imbalance (especially in the ten-class decile setting). In addition, we repeat the stratified baseline evaluation across five random seeds and report mean ± standard deviation in Appendix A Table A4. For the deep multimodal models, we keep the train/validation split fixed (following the original protocol) to ensure comparability across deep-model variants. Because the shallow baselines show low variability across multiple stratified splits (Appendix A Table A4), they provide a stable reference level for contextualizing performance. A strict head-to-head comparison of deep and shallow models would require evaluating all models on identical multi-seed splits, which we leave for future work.

Model selection

After each epoch, we evaluated performance on the validation split and saved a checkpoint. For all reported results, we selected the checkpoint with the best validation performance (macro-F1 as the primary criterion; accuracy reported as secondary) and report the corresponding metrics on the same held-out validation split. The same split and selection criteria are used consistently across all variants.

3.2. Number of Classes for Classification

Hit song prediction has traditionally been framed as a binary classification problem, separating songs into “hits” and “non-hits”. While this approach is easy to implement, it may obscure the finer gradations of popularity. For instance, Martín-Gutiérrez et al. [11] introduced a three-class scheme, grouping tracks into low, mid, and high popularity according to predefined thresholds. In our work, we likewise evaluate a three-class formulation (among others), with class boundaries determined using the following formula:

y = \{\begin{matrix} 0, & 0 \leq p < 25 \\ 1, & 25 \leq p < 65 \\ 2, & 65 \leq p \leq 100 \end{matrix}

(2)

Compared to the binary classification, this method improves granularity, highlighting distinctions between moderately popular and highly popular songs. However, most songs cluster in the mid-popularity range due to the dataset’s near-normal distribution, limiting the effectiveness of the three-class approach.

To address this shortcoming, we introduce a ten-class classification model, dividing the popularity range into deciles, as follows:

y = \min (9, ⌊\frac{p}{10}⌋), 0 \leq p \leq 100

(3)

In addition, we report an intermediate 5-class labeling (0–19, 20–39, 40–59, 60–79, 80–100) and the corresponding capacity-controlled baselines in Appendix A Table A9. This finer granularity provides a more detailed view of popularity levels; however, it also yields a pronounced class imbalance in the 10-class setting (Figure 3), with middle deciles dominating and extreme tiers being comparatively rare. From an application perspective, finer-grained predictions can be valuable because they differentiate intermediate levels of popularity rather than collapsing them into a single “mid” category.

3.3. Empirical Findings

3.3.1. Performance Comparison of Regression and Classification Models

The evaluation of the models is based on four key performance metrics—macro-precision, macro-recall, macro-F1, and accuracy. Given the class imbalance (Figure 3), we primarily interpret performance using macro-F1; macro-precision and macro-recall are reported as complementary measures. The results indicate that classification models generally outperform regression models in macro-F1 (and also in accuracy) across both the three-class and ten-class configurations (Table 5). Accuracy is given in its standard form.

Capacity-controlled shallow baselines

To rule out the possibility that the ablation effects are primarily driven by differences in model capacity (i.e., varying parameter counts after removing entire branches), we additionally evaluate capacity-controlled shallow baselines. Concretely, we train the same classifier family on fixed train/validation splits while varying only the available feature subset (metadata-only, audio-only, early fusion, late fusion). The results are summarized in Table 6. In particular, Table 6 includes a metadata-only baseline (artist popularity and release year) to quantify how much predictive signal is captured by contextual features alone.

In addition to the early-fusion shallow baseline (feature concatenation), Table 6 also reports a late-fusion baseline that combines modality-specific predictors at the decision level by weighted averaging of class-score vectors (softmax outputs) of audio-only and metadata-only models. Late fusion yields comparable accuracy but does not outperform early fusion in macro-F1, suggesting that early fusion already captures most of the complementary signal between modalities in this capacity-controlled setting. Attention-based fusion inside the deep network remains a promising extension but would require architectural changes and extensive retraining, which we leave for future work.

Table 6. Capacity-controlled shallow baselines for the 10-class popularity prediction task. The same classifier family is trained on an identical stratified split while varying only the available feature subset (metadata-only, audio-only, early fusion, late fusion). Early fusion concatenates feature sets before classification, whereas late fusion combines the class-score vectors (softmax outputs) of unimodal classifiers via weighted averaging (fusion weight tuned on the held-out split). Unless stated otherwise, metrics are macro-averaged.

Feature Set	Macro-F1 (%)	Macro-Recall (%)	Macro-Precision (%)	Accuracy (%)	Within ±1 tier (%)	Tier MAE (↓)	ROC-AUC (Macro, %)	PR-AUC (Macro, %)
Audio-only	8.99	11.96	13.53	24.48	64.45	1.265	63.60	13.00
Metadata-only	20.92	20.50	34.00	34.22	80.74	0.912	80.60	24.80
Early fusion (audio+ meta)	22.96	22.07	33.20	34.47	80.79	0.901	81.10	25.20
Late fusion (weighted class-score vectors)	22.21	21.47	34.52	34.23	80.73	0.913	80.70	24.90
Early fusion (class_weight = balanced)	20.13	30.03	17.79	23.82	63.03	1.318	79.40	19.00

↓ indicates that lower values are better.

Imbalance-mitigation control

To empirically probe imbalance mitigation in a capacity-controlled setting, we additionally trained an early-fusion baseline with class-weighted training (class_weight = balanced). As shown in Table 6, class weighting increases macro-recall (i.e., improves coverage of rare tiers) but reduces overall accuracy and macro-F1, reflecting the expected trade-off in this strongly long-tailed 10-class setting. Across baselines, the ordinal-aware metrics show that many misclassifications are near misses (e.g., early fusion reaches 80.79% within ±1 tier). However, class weighting reduces within-one-tier accuracy and worsens tier MAE, indicating that improved coverage of rare tiers comes at the cost of ordinal proximity.

In addition, because our cross-platform Essentia feature configuration omits speechiness and liveness, we report a sensitivity analysis showing that this omission has a negligible impact in the 3-class task and only a small macro-F1 decrease in the 10-class task (Appendix A Table A8).

Three-Class Regression Model

The three-class regression model attains a macro-F1 of 48.30% (precision 74.14%, recall 45.09%), indicating moderate effectiveness in distinguishing the popularity tiers, with limited recall as the main bottleneck; for completeness, its overall accuracy is 81.55%.

An F1-score of 48.30%, which harmonizes precision and recall, suggests a moderate effectiveness in hit classification, constrained primarily by its insufficient recall performance.

Ten-Class Regression Model

In the ten-class setting, the model’s precision (29.27%) and recall (16.62%) are markedly lower than in the three-class scenario; consequently, the F1-score drops to 16.30%, indicating that the model struggles to separate closely related popularity tiers. This reduction is unsurprising given the greater challenge of distinguishing among ten distinct popularity levels. For completeness, the model achieves 35.27% accuracy, which remains well above the 10% random-guessing baseline, suggesting it still captures meaningful associations between the input features and song popularity.

Three-Class Classification Model

The three-class classification model yields the strongest overall classification quality, achieving a macro-F1 of 52.20% (precision 74.79%, recall 47.86%) and an accuracy of 82.63%, outperforming the regression-based discretization across all reported metrics.

Ten-Class Classification Model

For the ten-class classification model, the approach yields a precision of 33.90%, a recall of 21.84%, and a macro-F1 of 23.15%, improving upon the corresponding ten-class regression results. For completeness, overall accuracy is 37.00%, which also exceeds the ten-class regression model (35.27%). Although classification remains more effective than regression in this setting, overall performance is constrained by the task’s increased complexity, as evidenced by the pronounced drop in recall when moving from three to ten classes.

Benchmarking against simple baselines

To contextualize the reported ten-class benchmark, we compare the proposed deep multimodal model to capacity-controlled shallow baselines trained on the same 10-class task (Table 6). The metadata-only baseline already achieves 34.22% accuracy and 20.92% macro-F1, highlighting that release year and artist popularity carry substantial predictive signal. The proposed multimodal model improves to 37.00% accuracy and 23.15% macro-F1 (Table 5), corresponding to a gain of +2.78 percentage points in accuracy and +2.23 points in macro-F1 over metadata-only. Compared to a simple early-fusion baseline, the proposed model yields a +2.53 percentage point accuracy gain, while macro-F1 remains similar, suggesting that much of the performance is driven by a strong metadata signal and that further gains likely require more targeted text/audio modeling and imbalance-aware objectives.

Model-to-model variance

The deep multimodal results are currently reported for a single train/validation split and initialization. Prior large-scale hit song prediction studies on similarly sized datasets (~100 k tracks) report that random-seed variation is typically below one percentage point in accuracy [6,11]. To provide an empirical robustness check within the revision constraints, we repeated the capacity-controlled shallow baselines (Table 6) across five random seeds (each defining a stratified 80/20 split and classifier initialization) and report mean ± standard deviation in Appendix A Table A4. The observed variability is low (e.g., early fusion: 35.03% ± 0.32 accuracy and 23.46% ± 0.44 macro-F1), increasing confidence that the main baseline trends are not an artifact of a particular random split. Reporting mean ± standard deviation across repeated runs is also common practice in MIR; for example, ref. [25] repeats CNN experiments across five random initializations and reports the mean and standard deviation on the test set. Full multi-seed retraining and confidence intervals for the deep multimodal models remain important future work.

3.3.2. Ablation Study

To assess the contribution of each feature modality, we performed an ablation study by training four models, each excluding one of the primary feature categories—high-level audio descriptors (Spotify audio features), low-level audio (log-Mel spectrogram CNN input), lyric embeddings (BERT), and metadata (artist popularity, release year). For low-level audio and lyrics, we removed the corresponding model branch; for high-level audio descriptors and metadata, we removed the respective feature subset from the structured numeric input vector before fusion.

For instance, when metadata were excluded, both the release year and artist popularity were eliminated from the input vector while retaining the high-level Spotify audio descriptors. This exclusion led to variations in the number of trainable parameters across models, as different sections of the neural network were effectively pruned. Specifically, the lyric extraction module contained the highest number of parameters (262,656), followed by the spectrogram-based low-level audio feature extraction module (92,736), and finally the structured-numeric branch (high-level audio descriptors + metadata) (5888). Parameter counts, therefore, differ mainly when entire branches are removed (e.g., the audio or lyric branch).

Importantly, however, the metadata ablation removes only two scalar inputs (release year and artist popularity) from the 26-D structured-numeric vector and changes model capacity by just 128 parameters in the first FC layer (454,272 → 454,144 trainable parameters; derived from the model definition). Therefore, the pronounced performance drop when removing metadata cannot be explained by reduced model complexity. Conversely, removing the full lyric branch reduces the parameter count substantially, yet performance remains comparable (Table 7), further indicating that parameter-count differences are not the primary driver.

To further disentangle modality signal from potential capacity effects, we report capacity-controlled shallow baselines (Table 6), where the same classifier family is trained on an identical split while varying only the feature subset (metadata-only, audio-only, and simple early/late fusion). These results corroborate the modality trend under fixed model capacity: metadata-only performs substantially better than audio-only, and early fusion improves over single-modality baselines. Together with the negligible capacity change under metadata removal in the deep ablation (Δ128 parameters), this supports that the pronounced degradation observed when removing metadata reflects substantial predictive signal in release year and artist popularity rather than a capacity confound.

All models adhered to the same baseline architecture and training configuration as the reference model. The ten-class classification model was selected as the baseline due to its superior granularity in predictive categorization and its demonstrated advantage over regression-based approaches. To ensure fair comparability, training and validation splits were kept consistent across all ablated models, and hyperparameters—including learning rate, batch size, and number of epochs—were held constant throughout the experimental setup.

Table 7 presents macro-precision, macro-recall, macro-F1, and accuracy for each ablated model alongside the full model. The model trained without metadata exhibited the most pronounced decline in predictive performance, with macro-F1 dropping from 23.15% to 13.65% and macro-recall dropping from 21.84% to 14.21% (Table 7), underscoring the critical role of metadata for class-balanced performance. For completeness, accuracy shows the same overall pattern (37.00% → 27.11%).

Notably, removing the lyric-embedding branch yields comparable performance to the full model (Table 7), suggesting that the current frozen lyric representations provide only a marginal incremental signal in this setting. A plausible explanation is that lyrics are noisy/incomplete for a subset of tracks and are truncated to 512 tokens, while the encoder is not fine-tuned for popularity prediction. We therefore consider lyric-model fine-tuning and improved long-sequence handling as promising directions for future work.

Removing low-level audio or high-level audio descriptors changes the macro-F1/recall only marginally, whereas removing metadata causes a large drop across macro-F1, precision, and recall, indicating that metadata is the dominant driver of robust class-wise performance; accuracy is reported as a secondary metric and shows the same overall pattern.

While macro-F1 and macro-recall exhibited only marginal reductions for most modality removals, variations in macro-precision were more pronounced. Notably, the model excluding high-level audio descriptors shows a higher macro-precision than other variants (Table 7), which may reflect a precision–recall trade-off and class-imbalance effects rather than a consistent gain. We therefore interpret this result cautiously and prioritize macro-F1 and macro-recall as the primary class-balanced indicators.

To corroborate the ten-class ablation, we ran an additional three-class ablation as a robustness check. Because the goal was confirmation rather than exhaustive hyper-parameter tuning, each model was fine-tuned for five epochs—a pragmatic shortcut often used in rapid ablation studies once the convergence profile of the architecture has been established [10,11]. This compact training schedule preserves GPU resources and turnaround time while still capturing the relative ranking of feature modalities.

Table A2 in the Appendix A summarizes the macro-averaged precision, recall and F1-score. The pattern mirrors the ten-class experiment: removing metadata produces by far the steepest decline in performance, whereas dropping high-level audio descriptors, lyrics, or low-level spectrogram features causes only marginal losses. The full multimodal model continues to yield the highest overall scores.

Figure 4 visualizes the effect in terms of accuracy deltas; the same ranking and conclusion are reflected in the class-balanced metrics (macro-F1 and macro-recall) reported in Table 7. The steepest degradation (−9.9 percentage points relative to the full model) appears when metadata are withheld. This behavior is expected for two reasons. First, the artist-popularity score is a strong proxy for an existing fan base; prior work shows that past chart success explains a large share of future streams on Spotify [26,27]. Second, the release-year indicator embeds every track in its historical context and lets the network learn decade-specific production styles and listening fashions, which are known to shift rapidly in pop music [28]. Together, these two variables capture external, market-side forces that audio and lyric content alone cannot convey. Hence, when the model loses access to metadata, it forfeits direct information about both fan-base size and temporal trend alignment, leading to the largest performance drop observed in the ablation study.

Overall, the findings from this ablation study confirm that metadata plays a pivotal role in predictive modeling within the domain of HSS. The strong predictive power of artist popularity and temporal context (release year) aligns with prior research indicating that external contextual factors often exert a stronger influence on commercial success than intrinsic audio or lyrical characteristics alone [26,27]. While other feature types remain valuable, their individual contributions appear less pronounced when metadata is available, highlighting the need for further exploration into feature interactions and potential redundancies in multimodal predictive frameworks.

4. Discussion

This study advances hit song prediction in several ways. First, we introduce an end-to-end model that fuses spectrogram-based CNN embeddings, BERT lyric representations, and contextual metadata in a single early fusion pipeline, bringing together three information sources that are rarely used together at this scale. Second, we provide a ten-class decile benchmark on SpotGenTrack, illustrating how performance scales as label granularity increases. Third, the capacity-controlled baselines indicate that early fusion slightly outperforms late fusion under the same protocol, supporting our early-fusion design choice beyond computational efficiency.

Overall, DeepHits demonstrates strong class-wise robustness in the established three-class setting, achieving a macro-F1 of 52.20% (macro-precision 74.79%, macro-recall 47.86%). For completeness, the corresponding accuracy is 82.63%, which is in the same range as the three-class results reported in related work on SpotGenTrack [11]; however, due to differences in preprocessing and evaluation protocols, this comparison should be interpreted as contextual rather than a strict head-to-head.

In the ten-class setting, DeepHits provides a more fine-grained decile benchmark and attains a macro-F1 of 23.15% (macro-precision 33.90%, macro-recall 21.84%). The overall accuracy is 37.00%, well above the 10% chance level, indicating that the multimodal model captures meaningful structure even when distinguishing among adjacent popularity deciles. Importantly, this performance exceeds strong capacity-controlled baselines within our baseline evaluation framework, which contextualizes the benchmark beyond a single accuracy figure.

While prior studies such as [11] report strong results on related popularity prediction settings, they typically formulate the problem using different label definitions (e.g., three coarse popularity classes) and evaluation protocols. Therefore, we refrain from claiming direct head-to-head comparability for the ten-class decile setup. Instead, we provide capacity-controlled baselines (Table 6) and deep ablations (Table 7) under a fixed protocol to establish a reproducible reference point for future work.

Achieving strong results in the ten-class scenario is inherently harder than in the three-class task, because the network must recognize fine-grained differences in popularity instead of simply separating hits from non-hits. A steep drop in accuracy is therefore expected; neighboring deciles have almost identical audio and lyric profiles, so the decision boundaries blur as the number of classes grows. In addition, the popularity distribution is long-tailed, which is a well-known pattern in MIR datasets. The extreme deciles (below the 10th and above the 90th percentile) contain only a few hundred tracks, whereas the middle deciles dominate the training set. This imbalance pulls the optimizer toward the dense center and lowers recall on the rare edge classes. Taken together, feature overlap and class-size imbalance explain most of the 45-percentage-point gap between the three- and ten-class accuracies. A similar degradation is observed in the class-balanced metric macro-F1 (52.20% → 23.15%), reinforcing that fine-grained tier prediction is challenging even when evaluated beyond accuracy. These observations suggest that future work should explore ordinal-aware or cost-sensitive training schemes to soften this effect.

To visualize where the ten-class classifier succeeds and where it still struggles, Figure 5 plots the normalized confusion matrix on the validation set.

The confusion matrix exhibits a clear near-diagonal structure: for each true tier, the highest mass lies on the correct tier or adjacent tiers, and large jumps across many tiers are rare. This indicates that the classifier largely respects the ordinal structure of popularity and that many errors are near-misses rather than gross misclassifications. Consistent with this pattern, the ordinal-aware metrics (within ±1 tier accuracy and tier MAE) suggest that predictions are often close to the correct tier even when exact-tier accuracy remains modest. The outermost tiers remain challenging, as the very lowest and very highest groups contain comparatively few training examples and differ only subtly from neighboring tiers. Overall, the class-by-class view highlights the value of the ten-tier scheme: unlike a coarse three-tier setup, it preserves intermediate levels of success and reveals where misclassifications concentrate along the ordinal axis.

Figure 6 and Figure 7 complement the analysis by disentangling the optimization dynamics of the ten-class classifier into loss- and accuracy-specific trends. As shown in Figure 6, both training and validation loss decrease sharply within the first ten epochs before reaching a clear plateau, indicating that most performance gains are realized early in training. Figure 7 mirrors this behavior: accuracy rises quickly and then stabilizes slightly below 0.38, consistent with the observed loss floor. The persistent gap between the training and validation curves, together with the relatively high loss floor and early plateau, suggests that performance is constrained by task difficulty and optimization limits (e.g., strong class imbalance and fine-grained tier separation) rather than by severe overfitting. Overall, the curves indicate early saturation, motivating future work on imbalance-aware objectives and/or richer text/audio modeling. Accordingly, we did not rely on early stopping in the final experiments and instead trained for a fixed number of epochs to ensure strict comparability across model variants. Future iterations may reintroduce early stopping primarily for computational efficiency, alongside additional capacity or stronger class-balancing strategies.

Table 8 compares our DeepHits model with the most directly comparable recent systems. Martín-Gutiérrez et al. [11] report 83.02% accuracy for the three-class SpotGenTrack setting but do not report macro-F1; we obtain a comparable accuracy (82.63%) and additionally report macro-F1 (52.20%). Because preprocessing and evaluation protocols differ across studies, this comparison should be interpreted as contextual rather than a direct head-to-head. To our knowledge, no prior work reports a ten-class decile formulation on SpotGenTrack; we therefore provide a reproducible ten-class benchmark under a fixed protocol.

5. Conclusions

5.1. Summary

Our findings underscore the value of a multimodal approach to hit song prediction. Unlike earlier studies that treated audio, lyrics, or metadata in isolation, DeepHits fuses (i) CNN-derived log-Mel spectrogram features, (ii) multilingual BERT lyric embeddings, and (iii) high-level audio descriptors together with compact contextual metadata (release year and artist popularity) in a single end-to-end early-fusion network.

Evaluated on 92,517 tracks from the SpotGenTrack dataset, DeepHits achieves a macro-F1 of 52.20% (accuracy 82.63%) in the established three-class setting. In addition, we provide a ten-class decile benchmark on this dataset, reaching a macro-F1 of 23.15% (accuracy 37.00%), well above the 10% random-guessing baseline. To contextualize this benchmark beyond a single accuracy figure, we report a suite of capacity-controlled baselines (Table 6); for example, a metadata-only baseline reaches 20.92% macro-F1 (34.22% accuracy), and simple fusion baselines remain below the deep multimodal model.

A key empirical insight comes from the ablation study: removing metadata (artist popularity and release year) produces the steepest performance drop, with ten-class macro-F1 decreasing from 23.15% to 13.65% (and accuracy from 37.00% to 27.11%). This result supports the view that commercial outcomes are strongly shaped by extrinsic context and dissemination dynamics, including social influence and network effects [26] as well as broader structural patterns observed in chart success [27]. At the same time, the relevance of release-year information is consistent with evidence that musical trends—and the predictability of success—shift over time [28].

5.2. Limitations

Despite encouraging performance, several limitations remain. First, popularity is time-dependent; training on a fixed historical window (ending in 2020) risks temporal drift as production styles and listener preferences evolve, as documented in large-scale analyses of musical trends [28] and discussions of changing aesthetic cultures [29]. Second, the ten-class task is affected by class imbalance and feature overlap between adjacent popularity tiers, which can reduce recall for rare extreme classes. As a lightweight check, we additionally evaluated class weighting in a capacity-controlled shallow baseline (Table 6), which improved recall for the rare extreme tiers but decreased overall accuracy and macro-F1. This suggests that imbalance-aware objectives (e.g., focal loss, ordinal objectives, or cost-sensitive training) are promising directions for future work. Third, our inputs are constrained by the use of 30 s previews and frozen (non-fine-tuned) lyric embeddings and a fixed 512-token context window (with imperfect lyric availability), which may miss longer-range structure and task-specific linguistic nuance. Finally, our deep multimodal results are reported for a single split/seed, so we do not yet provide confidence intervals for the metrics.

Our preprocessing filters remove highly speech-dominated content and very long/very short tracks to focus the task on songs with consistent 30 s audio inputs. While appropriate for hit-song prediction, this may limit direct applicability to spoken-word content (e.g., podcasts/audiobooks) and certain long-form genres; future work should relax these constraints and/or use longer audio segments for such domains.

Our results are currently reported on a fixed train/validation split, which may not be fully representative under strong class imbalance and long-tail effects. Future work should therefore include split-robust evaluation, e.g., k-fold cross-validation or repeated stratified train/validation splits, to report confidence intervals and quantify variability across data partitions. This would provide a more comprehensive estimate of generalization performance beyond a single split.

5.3. Future Work

Future work will (i) run multi-seed replications to report confidence intervals; (ii) explore ordinal- and imbalance-aware training (e.g., cost-sensitive or ordinal losses) to better reflect the ordered nature of popularity tiers; and (iii) add post hoc explainability (e.g., SHAP values and integrated gradients) to quantify how time–frequency regions, lyric segments, and metadata fields drive predictions.

Beyond this, we plan to incorporate additional pre-release contextual signals that are not captured by our current metadata. This includes social influence and network-related factors that have been shown to matter for popularity prediction [26] and complementary market-facing signals such as social media indicators [30] or neuro-/physiological responses [31]. Finally, we will evaluate adaptation strategies (online/transfer learning) to reduce temporal drift and test on more culturally diverse benchmarks, motivated by evidence that aesthetic cultures evolve and differ across contexts [29].

Author Contributions

Conceptualization: M.N. and V.N.; methodology: V.N.; software: V.N.; validation: V.N.; formal analysis: V.N.; investigation: V.N.; resources: M.N. and O.H.; data curation: V.N.; writing—original draft preparation: M.N.; writing—review and editing: M.N., V.N., and O.H.; visualization: M.N. and V.N.; supervision: M.N.; project administration: M.N. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

Dataset available upon request from the authors.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

Figure A1. Per-tier recall for the early-fusion shallow baseline in the 10-class setting, comparing unweighted training and class-weighted training (balanced). Class weighting improves recall in the rare extreme tiers (8–9) at the cost of mid-tier recall, illustrating the accuracy–equity trade-off under strong class imbalance.

Figure A2. Distribution of lyric lengths (word counts) for tracks with available lyrics in the filtered dataset. The dashed vertical line indicates 512 words as a conservative proxy for the 512-token context limit used by BERT; values above 1500 words are clipped for visualization.

Table A1. Detailed architecture of DeepHits.

Number	Layer (Branch)	Kernel/Units	Stride	Padding	Output Shape *	Number of Parameters
Low-level-Audio-CNN (Mel-Spec 1 × 128 × 1292)
1	Conv2d 1→8	5 × 5	2	2	8 × 64 × 646	208
2	BN + ReLU	–	–	–	8 × 64 × 646	16
Residual Block 1
3	Conv2d 8→16	3 × 3	2	1	16 × 32 × 323	1152
4	BN + ReLU	–	–	–	16 × 32 × 323	32
5	Conv2d 16→16	3 × 3	1	1	16 × 32 × 323	2304
6	BN	–	–	–	16 × 32 × 323	32
7	Shortcut 1 × 1 Conv 8→16	1 × 1	2	0	16 × 32 × 323	128
8	Shortcut BN	–	–	–	16 × 32 × 323	32
Residual Block 2
9	Conv2d 16→32	3 × 3	2	1	32 × 16 × 162	4608
10	BN + ReLU	–	–	–	32 × 16 × 162	64
11	Conv2d 32→32	3 × 3	1	1	32 × 16 × 162	9216
12	BN	–	–	–	32 × 16 × 162	64
13	Shortcut 1 × 1 Conv 16→32	1 × 1	2	0	32 × 16 × 162	512
14	Shortcut BN	–	–	–	32 × 16 × 162	64
Residual Block 3
15	Conv2d 32→64	3 × 3	2	1	64 × 8 × 81	18,432
16	BN + ReLU	–	–	–	64 × 8 × 81	128
17	Conv2d 64→64	3 × 3	1	1	64 × 8 × 81	36,864
18	BN	–	–	–	64 × 8 × 81	128
19	Shortcut 1 × 1 Conv 32→64	1 × 1	2	0	64 × 8 × 81	2048
20	Shortcut BN	–	–	–	64 × 8 × 81	128
21	AdaptiveAvgPool2d	–	–	–	64 × 1 × 1	–
22	FC 64→128	–	–	–	128	8320
23	ReLU	–	–	–	128	–
24	FC 128→64	–	–	–	64	8256
High-level + Metadata Branch (26-D)
25	FC 26→64	–	–	–	64	1728
26	ReLU	–	–	–	64	–
27	FC 64→64	–	–	–	64	4160
Lyrics-Branch (BERT CLS 768)
28	FC 768→256	–	–	–	256	196,864
29	ReLU	–	–	–	256	–
30	FC 256→256	–	–	–	256	65,792
Fusion and Prediction Head (64 + 64 + 256 = 384)
31	Dropout (p = 0.5)	–	–	–	384	–
32	FC 384→256	–	–	–	256	98,560
33	FC 256→64	–	–	–	64	16,448
34	Output layer (3-Class)	64→3	–	–	3	195

* Conv-Tensor-Form: Channels × Height × Width; Vector: Feature length; all convolutions use Kaiming He initialization and ReLU activation; BN = batch normalization. Activation layers after the last FC in each branch are omitted for brevity.

Table A2. Macro-precision, macro-recall, and macro-F1 of ablated 3-class models and the base model using all features.

Model	Macro- Precision (%)	Macro- Recall (%)	Macro- F1-Score (%)	Accuracy (%)
Without high-level audio descriptors (Spotify features)	74.05	48.21	52.58	82.51
Without lyric embeddings (BERT)	75.25	47.79	50.79	82.26
Without low-level audio (Mel-spectrogram CNN)	77.37	47.51	50.83	82.54
Without metadata (artist popularity, release year)	50.09	35.17	33.06	79.07
All features	74.79	47.86	52.20	82.63

Table A3. Lyrics availability and length statistics in the filtered SpotGenTrack dataset used for training and evaluation. Word counts are computed via whitespace tokenization and serve as a conservative proxy for BERT’s 512-token context limit.

Statistic	Value
Tracks after filtering (speechiness ≤ 0.66 and duration ≤ 7 min)	92,517
Tracks with missing lyrics (placeholder)	13,620
Missing lyrics (%)	14.68
Median lyric length (words; non-missing)	254
95th percentile lyric length (words; non-missing)	583
99th percentile lyric length (words; non-missing)	841
Lyrics exceeding 512 words (%) (word-count proxy; non-missing)	7.81
Lyrics exceeding 512 words (%) in the top two tiers (8–9; word-count proxy; non-missing)	19.83

Table A4. Multi-seed stability of capacity-controlled shallow baselines for the 10-class task (mean ± std over 5 stratified 80/20 splits; same classifier family and feature sets as in Table 5).

Model	Macro-F1 (%)	Macro-Recall (%)	Macro-Precision (%)	Accuracy (%)	Within ±1 Tier (%)	Tier MAE (↓)
Audio-only	7.30 ± 0.21	10.67 ± 0.32	9.17 ± 0.25	24.11 ± 0.36	63.45 ± 0.67	1.351 ± 0.009
Metadata-only	22.21 ± 0.72	21.10 ± 0.69	34.45 ± 0.56	34.16 ± 0.31	79.74 ± 0.40	0.927 ± 0.006
Early fusion (audio+meta)	23.46 ± 0.44	21.83 ± 0.38	30.68 ± 0.37	35.03 ± 0.32	82.33 ± 0.42	0.864 ± 0.007
Late fusion (w_meta = 0.9)	22.30 ± 0.65	20.95 ± 0.63	32.50 ± 0.50	35.12 ± 0.37	80.05 ± 0.47	0.917 ± 0.007

↓ indicates that lower values are better.

Table A5. Dataset filtering criteria and impact on the SpotGenTrack corpus. Removed percentages are reported w.r.t. the raw corpus size (N = 101,939).

Step	Filter	Removed n	Removed %	Remaining n
Raw dataset	-	0	0.00	101,939
(1) Speech filter	speechiness > 0.66	5362	5.26	96,577
(2) Long-track filter	duration_ms > 420,000 (7 min)	3792	3.72	92,785
(3) Short-track filter	duration_ms < 30,000 (30 s)	268	0.26	92,517

Table A6. Most frequent Spotify artist genres among tracks removed by the speechiness filter (speechiness > 0.66). Percentages denote the share of removed tracks tagged with the respective genre.

Genre	Count	Share_of_Tracks_%
Radio play	1218	22.7
Guidance	270	5.0
Children’s audio stories	161	3.0
Drama	159	3.0
Poetry	138	2.6
Comedy	126	2.3
Comic	104	1.9
Reading	100	1.9
Children’s music	91	1.7
German poetry	72	1.3

Table A7. Sensitivity of the speechiness threshold. We report the number of tracks removed by the speech filter and the resulting final dataset size for alternative cutoffs.

Speechiness Threshold	Removed by Speech n	Removed by Speech %	Final Dataset	Final Dataset %	Rap Tracks Removed by Speech n	Rap Share Within Speech-Removed %
0.6	5539	5.43	92,340	90.58	68	1.23
0.66	5362	5.26	92,517	90.76	52	0.97
0.7	5245	5.15	92,634	90.87	43	0.82

Table A8. Sensitivity analysis for missing Spotify descriptors that are not reproduced in our Essentia-based cross-platform feature configuration (speechiness and liveness). We report changes (Δ) in percentage points relative to the corresponding early-fusion shallow baseline using all descriptors.

Task	Removed Descriptors	Δ Accuracy (pp)	Δ Macro-F1 (pp)
3-class	w/o speechiness	−0.05	−0.06
3-class	w/o liveness	−0.06	−0.04
3-class	w/o speechiness & liveness	−0.05	−0.04
10-class	w/o speechiness	0.01	−1.09
10-class	w/o liveness	0.04	0.02
10-class	w/o speechiness & liveness	−0.05	−1.20

Table A9. Shallow baselines for an intermediate 5-class popularity labeling (bins: 0–19, 20–39, 40–59, 60–79, 80–100). We report metadata-only, audio-only, early fusion, and late fusion baselines; metrics are macro-averaged (AUCs: one-vs-rest).

Model	Accuracy (%)	Macro-Precision (%)	Macro-Recall (%)	Macro-F1 (%)	ROC-AUC (OVR, Macro, %)	PR-AUC (Macro, %)
Metadata-only	58.80	50.11	38.14	40.73	83.08	45.84
Audio-only	43.48	27.43	21.27	16.63	60.74	23.70
Early fusion (audio+meta)	58.69	49.00	37.71	40.10	83.45	46.20
Late fusion (weighted class-score vectors)	57.32	51.81	31.19	30.56	82.61	45.97

References

Seufitelli, D.B.; Oliveira, G.P.; Silva, M.O.; Scofield, C.; Moro, M.M. Hit song science: A comprehensive survey and research directions. J. New Music Res. 2023, 52, 41–72. [Google Scholar] [CrossRef]
Pachet, F.; Roy, P. Hit Song Science Is Not Yet a Science. In Proceedings of the 9th International Conference on Music Information Retrieval (ISMIR 2008), Philadelphia, PA, USA, 14–18 September 2008; pp. 355–360. [Google Scholar]
Bayley, J. IFPI Global Music Report: Global Recorded Music Revenues Grew 10.2% in 2023—IFPI. IFPI Report 2024. Available online: https://www.ifpi.org/ifpi-global-music-report-global-recorded-music-revenues-grew-10-2-in-2023 (accessed on 25 March 2025).
Dhanaraj, R.; Logan, B. Automatic prediction of hit songs. In Proceedings of the 6th International Conference on Music Information Retrieval (ISMIR), London, UK, 11–15 September 2005; pp. 488–491. [Google Scholar] [CrossRef]
Zhao, M.; Harvey, M.; Cameron, D.; Hopfgartner, F.; Gillet, V.J. An analysis of classification approaches for hit song prediction using engineered metadata features with lyrics and audio features. In Proceedings of the International Conference on Information (iConference 2023), Barcelona, Spain, 13–29 March 2023; pp. 303–311. [Google Scholar] [CrossRef]
Yang, L.-C.; Chou, S.-Y.; Liu, J.-Y.; Yang, Y.-H.; Chen, Y.-A. Revisiting the problem of audio-based hit song prediction using convolutional neural networks. In Proceedings of the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, USA, 5–9 March 2017; pp. 621–625. [Google Scholar] [CrossRef]
Choi, K.; Fazekas, G.; Sandler, M.; Cho, K. Convolutional Recurrent Neural Networks for Music Classification. In Proceedings of the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, USA, 5–9 March 2017; pp. 2392–2396. [Google Scholar]
Oramas, S.; Nieto, O.; Sordo, M.; Serra, X. A deep multimodal approach for cold-start music recommendation. In Proceedings of the 2nd Workshop on Deep Learning for Recommender Systems (DLRS 2017), Como, Italy, 27 August 2017; pp. 32–37. [Google Scholar] [CrossRef]
Zangerle, E.; Vötter, M.; Huber, R.; Yang, Y.-H. Hit Song Prediction: Leveraging Low- and High-Level Audio Features. In Proceedings of the 20th International Society for Music Information Retrieval Conference (ISMIR 2019), Delft, The Netherlands, 4–8 November 2019; pp. 319–326. [Google Scholar] [CrossRef]
Vavaroutsos, P.; Vikatos, P. HSP-TL: A Deep Metric Learning Model with Triplet Loss for Hit Song Prediction. In Proceedings of the European Signal Processing Conference (EUSIPCO), Helsinki, Finland, 4–8 September 2023; pp. 146–150. [Google Scholar] [CrossRef]
Martín-Gutiérrez, D.; Hernández Peñaloza, G.; Belmonte-Hernández, A.; Álvarez García, F. A multimodal end-to-end deep learning architecture for music popularity prediction. IEEE Access 2020, 8, 39361–39374. [Google Scholar] [CrossRef]
Yu, L.-C.; Yang, Y.-H.; Hung, Y.-N.; Chen, Y.-A. Hit song prediction for pop music by siamese CNN with ranking loss. arXiv 2017, arXiv:1710.10814. [Google Scholar] [CrossRef]
Delbouys, R.; Hennequin, R.; Piccoli, F.; Royo-Letelier, J.; Moussallam, M. Music mood detection based on audio and lyrics with deep neural net. arXiv 2018, arXiv:1809.07276. [Google Scholar] [CrossRef]
Martín-Gutiérrez, D.; Hernández Peñaloza, G.; Belmonte-Hernández, A.; Álvarez García, F. SpotGenTrack Popularity Dataset. Available online: https://data.mendeley.com/datasets/4m2x4zngny/1 (accessed on 25 June 2025).
Alonso-Jiménez, P.; Bogdanov, D.; Pons, J.; Serra, X. Tensorflow audio models in Essentia. In Proceedings of the 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020. [Google Scholar] [CrossRef]
Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
Yang, D.; Lee, W.-S. Music emotion identification from lyrics. In Proceedings of the 11th IEEE International Symposium on Multimedia (ISM 2009), San Diego, CA, USA, 14–16 December 2009; pp. 624–629. [Google Scholar] [CrossRef]
McVicar, M.; Di Giorgi, B.; Dundar, B.; Mauch, M. Lyric document embeddings for music tagging. arXiv 2021, arXiv:2112.11436. [Google Scholar] [CrossRef]
Akalp, H.; Cigdem, E.F.; Yilmaz, S.; Bolucu, N.; Can, B. Language representation models for music genre classification using lyrics. In Proceedings of the 2021 International Symposium on Electrical, Electronics and Information Engineering (ISEEIE 2021), Seoul, Republic of Korea, 19–21 February 2021; pp. 408–414. [Google Scholar] [CrossRef]
Spotify for Developers. Get Track’s Audio Features (Web API Reference). Available online: https://developer.spotify.com/documentation/web-api (accessed on 9 February 2026).
Oramas, S.; Barbieri, F.; Nieto, O.; Serra, X. Multimodal deep learning for music genre classification. Trans. Int. Soc. Music Inf. Retr. 2018, 1, 4–21. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 11–18 December 2015; pp. 1026–1034. [Google Scholar] [CrossRef]
Smith, L.N.; Topin, N. Super-convergence: Very fast training of neural networks using large learning rates. In Proceedings of the SPIE Artificial Intelligence and Machine Learning for Multi-Domain Operations Applications, Baltimore, MD, USA, 15–17 April 2019; Volume 11006, pp. 369–386. [Google Scholar] [CrossRef]
Vujović, Ž. Classification model evaluation metrics. Int. J. Adv. Comput. Sci. Appl. 2021, 12, 599–606. [Google Scholar] [CrossRef]
Prinz, K.; Flexer, A.; Widmer, G. The Impact of Label Noise on a Music Tagger. arXiv 2020, arXiv:2008.06273. [Google Scholar] [CrossRef]
Reisz, N.; Servedio, V.D.; Thurner, S. Quantifying the impact of homophily and influencer networks on song popularity prediction. Sci. Rep. 2024, 14, 8929. [Google Scholar] [CrossRef] [PubMed]
Gourévitch, B. Billboard 200: The lessons of musical success in the US. Music Sci. 2023, 6, 1–13. [Google Scholar] [CrossRef]
Interiano, M.; Kazemi, K.; Wang, L.; Yang, J.; Yu, Z.; Komarova, N.L. Musical trends and predictability of success in contemporary songs in and out of the top charts. R. Soc. Open Sci. 2018, 5, 171274. [Google Scholar] [CrossRef] [PubMed]
Sinclair, N.C.; Ursell, J.; South, A.; Rendell, L. From Beethoven to Beyoncé: Do changing aesthetic cultures amount to “cumulative cultural evolution”? Front. Psychol. 2022, 12, 663397. [Google Scholar] [CrossRef] [PubMed]
Tsiara, E.; Tjortjis, C. Using Twitter to predict chart position for songs. In Artificial Intelligence Applications and Innovations, IFIP AIAI 2020; Springer: Cham, Switzerland, 2020; Volume 583, pp. 62–72. [Google Scholar] [CrossRef]
Merritt, S.H.; Gaffuri, K.; Zak, P.J. Accurately predicting hit songs using neurophysiology and machine learning. Front. Artif. Intell. 2023, 6, 1154663. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Distribution of tracks per artist and per album in the SpotGenTrack dataset (bucketed).

Figure 2. From raw audio, lyrics, and metadata to prediction.

Figure 3. Class distributions of the popularity labels after filtering (speechiness ≤ 0.66; duration ≤ 7 min) on SpotGenTrack. (a) 3-class labeling (low/medium/high). (b) 10-class labeling (deciles). The 10-class setting is highly imbalanced (long-tailed), motivating macro-averaged metrics and imbalance-aware evaluation.

Figure 4. Accuracy loss when each modality is omitted.

Figure 5. Confusion matrix for the classification task on the validation split. Rows denote the ground-truth popularity classes and columns the predicted classes. Diagonal cells indicate correct predictions, while off-diagonal cells represent misclassifications (often between adjacent popularity tiers).

Figure 6. Ten-Class Loss Curves.

Figure 7. Ten-Class Accuracy Curves.

Table 2. Dataset at a glance. Dataset size: 92,517 tracks. We evaluate 3–class and 10–class classification, and additionally a continuous 0–100 regression on the same dataset.

Modality	Raw Representation	Number of Features per Track
Low-level audio (log-Mel spectrograms)	30 s preview → log-Mel spectrogram (22,050 Hz, 1024-pt FFT, 128 Mel bands, hop = 512)	128 bins × 1292 frames = 165,376 coefficients
Lyrics	Full lyrics → multilingual BERT, CLS embedding	768-dimensional CLS vector
High-level audio descriptors (Spotify audio features)	Spotify descriptors (danceability, energy, valence, etc.)	12 descriptors
Metadata	Artist popularity, release year	2 features

Table 3. High-level audio descriptors (Spotify audio features) and their descriptions from the Spotify API.

Audio Features	Description
Acousticness	Confidence measure (0.0–1.0) of whether the track is acoustic; 1.0 indicates high confidence.
Danceability	How suitable a track is for dancing (0.0–1.0).
Duration	Duration of the track in milliseconds.
Energy	Perceptual measure of intensity and activity (0.0–1.0).
Instrumentalness	Likelihood that the track contains no vocals (0.0–1.0).
Key	The key the track is in. Integers map to pitches using standard Pitch Class notation. If no key was detected, the value is −1. Range from −1 to 11.
Liveness	Detects the presence of an audience (0.0–1.0); values above 0.8 suggest live recordings.
Loudness	Overall loudness in decibels (dB), typically between −60 and 0 dB.
Mode	Modality: 1 = major, 0 = minor.
Speechiness	Presence of spoken words; higher values indicate more speech-like content (e.g., talk shows, audiobooks).
Tempo	Overall estimated tempo in beats per minute (BPM).
Valence	Musical positiveness conveyed (0.0–1.0); higher values sound more positive.

Table 4. Data filtering criteria and their impact on the usable dataset size.

Filtering Step	Criterion	Removed (N)	Removed (%)	Remaining (N)
Raw dataset	-	-	-	101,939
Speech-dominant removal	speechiness > 0.66	5362	5.26	96,577
Very long track removal	duration > 7 min	3792	3.72	92,785
Short track removal	duration < 30 s	268	0.26	92,517

Table 5. Precision, recall, and F1-score of regression and classification models.

		Macro- Precision (%)	Macro- Recall (%)	Macro- F1-Score (%)	Accuracy (%)
Regression	3-Class	74.14	45.09	48.30	81.55
Regression	10-Class	29.27	16.62	16.30	35.27
Classification	3-Class	74.79	47.86	52.20	82.63
Classification	10-Class	33.90	21.84	23.15	37.00

Table 7. Precision, recall, and F1-Score of ablated 10-class models and the base model using all features.

Model	Macro- Precision (%)	Macro- Recall (%)	Macro- F1-Score (%)	Accuracy (%)
Without high-level audio descriptors (Spotify features)	42.95	22.52	23.80	36.70
Without lyric embeddings (BERT)	37.50	23.94	25.98	36.56
Without low-level audio (Mel-spectrogram CNN)	35.64	23.03	24.90	37.08
Without metadata (artist popularity, release year)	27.11	14.21	13.65	27.11
All features	33.90	21.84	23.15	37.00

Table 8. Comparison with most directly comparable recent systems.

Study (Year)	Modalities ¹	Dataset (Size)	Task	Number of Classes	Accuracy (%)	Macro-F1 (%)
Zangerle et al. (2019) [9]	A hi+lo, M	MSD + Billboard (11,000)	Hit vs. non-hit	2	75.0	-
Martín-Gutiérrez et al. (2020) [11]	A,L,M	SpotGenTrack (101,939)	Popularity	3	83.0	-
Vavaroutsos and Vikatos (2023) [10]	A,L,M	HSP + Genius (11,600)	Hit vs. non-hit	2	80.0	-
Our study (2025)	A (Mel),L,M	SpotGenTrack (92,517)	Popularity	3/10	82.63/37.00	52.20/23.15

¹ A = audio; L = lyrics; M = metadata; “hi+lo” = separate high-/low-level audio features.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Nofer, M.; Nimani, V.; Hinz, O. DeepHits: A Multimodal CNN Approach to Hit Song Prediction. Mach. Learn. Knowl. Extr. 2026, 8, 58. https://doi.org/10.3390/make8030058

AMA Style

Nofer M, Nimani V, Hinz O. DeepHits: A Multimodal CNN Approach to Hit Song Prediction. Machine Learning and Knowledge Extraction. 2026; 8(3):58. https://doi.org/10.3390/make8030058

Chicago/Turabian Style

Nofer, Michael, Valdrin Nimani, and Oliver Hinz. 2026. "DeepHits: A Multimodal CNN Approach to Hit Song Prediction" Machine Learning and Knowledge Extraction 8, no. 3: 58. https://doi.org/10.3390/make8030058

APA Style

Nofer, M., Nimani, V., & Hinz, O. (2026). DeepHits: A Multimodal CNN Approach to Hit Song Prediction. Machine Learning and Knowledge Extraction, 8(3), 58. https://doi.org/10.3390/make8030058

Article Menu

DeepHits: A Multimodal CNN Approach to Hit Song Prediction

Abstract

1. Introduction

2. Materials and Methods

2.1. Dataset

2.2. Implementation

2.2.1. Model Architecture

2.2.2. Training Configuration

2.2.3. Evaluation Metrics and Analysis

3. Results

3.1. Regression vs. Classification

3.2. Number of Classes for Classification

3.3. Empirical Findings

3.3.1. Performance Comparison of Regression and Classification Models

3.3.2. Ablation Study

4. Discussion

5. Conclusions

5.1. Summary

5.2. Limitations

5.3. Future Work

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Appendix A

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI