1. Introduction
The digital revolution in music consumption has fundamentally transformed how listeners discover, access, and engage with musical content, generating high-dimensional datasets with millions of tracks and billions of user interactions that pose significant computational and statistical challenges. Music streaming platforms such as Spotify, Apple Music, and YouTube Music have become dominant channels for music consumption, collectively serving billions of users worldwide and accumulating streaming event logs at scales exceeding petabytes [
1,
2]. This transformation has generated large-scale multivariate time-series data regarding listener preferences, high-dimensional audio feature vectors, and complex engagement patterns, creating new opportunities for data-driven mathematical modeling of music popularity dynamics. Personalized music recommendation systems, formulated as large-scale optimization problems, require sophisticated machine learning algorithms to analyze user–item interaction matrices, emotional response embeddings, and temporal listening behaviors [
3,
4,
5,
6]. The integration of deep learning architectures, including convolutional transformers and neural networks with millions of trainable parameters, has significantly advanced music recommender systems by enabling more accurate hierarchical feature extraction from audio spectrograms and improved genre classification through non-linear function approximation [
4,
7]. These technological advances underscore the critical importance of developing rigorous mathematical frameworks for understanding the quantitative relationships between audio features and music popularity in streaming environments. Predicting music popularity represents a fundamental challenge in this ecosystem, with substantial implications for multiple stakeholders. For artists and composers, understanding the quantitative relationships between audio features and popularity metrics enables informed creative decisions and strategic positioning of musical releases through data-driven optimization [
8,
9]. Record labels and music producers utilize popularity prediction models as decision support systems to identify promising talent, optimize marketing resource allocation through multi-objective optimization, and maximize expected returns [
10,
11,
12]. From broader perspectives, music popularity influences adolescent listening decisions through neural mechanisms associated with social influence [
13], correlates with consumer sentiment and cryptocurrency market returns [
14,
15], and affects the competitive dynamics of live music venues and industry sustainability [
16,
17,
18].
Empirical research on music popularity prediction has evolved substantially over the past decade, transitioning from simple linear regression models to sophisticated ensemble learning and deep neural network architectures capable of capturing complex non-linear mappings between feature spaces and target variables. Early studies focused on computing correlation coefficients between specific acoustic features and chart performance, demonstrating that characteristics such as danceability, happiness, and timbral qualities carry statistical predictive power for top-hit identification through hypothesis testing and feature selection methods [
10,
19]. Multimodal approaches combining audio analysis, lyrical content, and metadata through feature fusion and joint embedding spaces have shown superior performance compared to single-modality methods, with deep learning architectures achieving significant improvements in dimensionality reduction and popularity estimation through hierarchical representation learning [
20,
21]. Recent studies have investigated classification versus regression formulations, with clustering-based methods employing unsupervised learning algorithms and genre-specific approaches demonstrating enhanced accuracy through domain-specific model calibration and stratified modeling strategies [
8,
22]. The application of convolutional neural networks for proactive caching strategies in music video platforms has revealed the temporal variability of genre and mood preferences throughout the day, informing both popularity prediction and content delivery optimization [
23]. Comprehensive systems integrating data collection, feature extraction, trend prediction, and personalized recommendation have demonstrated prediction accuracies exceeding 90.5%, providing technical support for music platform operations [
24]. Large-scale empirical analyses utilizing Spotify’s global chart data have revealed that song popularity exhibits strong temporal dependencies, with streaming consumption patterns showing significant behavioral changes during events such as the COVID-19 pandemic [
1,
2]. Cross-cultural analyses have identified that musical attribute preferences (key, length, tempo) correlate with national cultural dimensions as defined by Hofstede’s framework, suggesting that culturally responsive marketing strategies should account for regional variations in musical taste [
11]. Music mobility pattern studies have uncovered systematic propagation paths for songs across countries, reflecting migratory flows and sociocultural similarities that can inform predictive models for track dissemination [
1]. Feature-level investigations consistently identify loudness, energy, and danceability as important predictors, though the magnitude and direction of these effects vary across genres and time periods [
19,
25,
26].
Machine learning methodologies for music popularity prediction encompass diverse algorithmic approaches, each offering distinct advantages for handling the complex, high-dimensional nature of audio feature spaces through different inductive biases and optimization strategies. Traditional ensemble methods including Random Forest with bootstrap aggregation, XGBoost with gradient-based boosting, and stochastic gradient boosting have demonstrated robust performance across multiple music datasets, with Random Forest classifiers leveraging variance reduction through ensemble averaging to achieve up to 90% classification accuracy in large-scale genre classification tasks [
27,
28,
29]. Comparative analyses between logistic regression with maximum likelihood estimation and support vector machines with kernel-based non-linear decision boundaries for predicting hit rates in music videos have shown that SVMs consistently outperform linear classifiers through implicit feature space mapping, achieving accuracies exceeding 87% [
22]. Neural network architectures, particularly multilayer perceptrons with universal approximation properties and recurrent networks with temporal memory mechanisms, have been successfully applied to music skip prediction as sequential decision problems, where late fusion architectures effectively handle user-interaction noise through probabilistic modeling [
25]. Network analysis approaches utilizing context-aware graphs and node embedding algorithms have enabled artist popularity estimation in music streaming services by capturing latent relationships between artists through social metadata and multimodal features [
30,
31]. Content-based approaches in music information retrieval have further enriched popularity modeling by establishing standardized track popularity datasets and multimodal feature representations that integrate audio signals with contextual metadata, enabling more nuanced characterization of audio-popularity relationships and informing content-driven recommendation frameworks [
32,
33,
34]. Collaborative filtering methods enhanced with social trust information and behavioral features have improved recommendation accuracy while addressing neighborhood bias in music recommender systems [
35,
36,
37], while ranking-based collaborative filtering in online music radios has demonstrated effectiveness in addressing the one-class recommendation problem where only listening records are available without explicit user ratings [
38].
Despite significant progress in predictive accuracy and methodological sophistication, several fundamental challenges persist in music popularity prediction research that limit both theoretical understanding and practical applicability. Popularity bias represents a critical issue in music recommendation systems, where highly popular tracks receive disproportionate exposure at the expense of less mainstream content [
39]. Counterfactual inference approaches have been proposed to mitigate both track-level and artist-level popularity biases through causal graph constructions and matrix factorization techniques [
39]. Data sparsity and cold-start problems for new users and tracks remain persistent challenges, motivating hybrid recommendation strategies that combine collaborative filtering, content-based filtering, and emotion-driven approaches [
5,
6,
40]. Model interpretability has emerged as a particularly critical concern, as the “black box” nature of deep learning models limits their practical applicability for music industry stakeholders requiring actionable insights into which specific audio characteristics drive popularity [
41]. Feature extraction from symbolic music representations presents additional complexities, requiring specialized architectures such as type-based mixture of experts with semi-supervised multi-task pre-training to capture hierarchical and polyphonic musical structures [
7]. Additionally, hyperparameter optimization in music prediction presents practical challenges: grid search becomes computationally intractable for high-dimensional hyperparameter spaces (6–8 parameters with continuous ranges); random search lacks principled exploration strategies for limited evaluation budgets, and manual tuning suffers from subjectivity and non-reproducibility. Bayesian optimization addresses these limitations by building probabilistic surrogate models of the objective function and using principled acquisition functions (such as Expected Improvement) to efficiently balance exploration and exploitation within constrained evaluation budgets, enabling sample-efficient search for optimal model configurations. These challenges underscore the need for approaches that balance predictive performance with interpretability, enabling stakeholders to understand not only what popularity scores are predicted but also why specific predictions are made.
Emerging research directions are actively addressing these limitations by integrating multimodal data sources, advanced computational frameworks, and interdisciplinary perspectives to enhance both predictive accuracy and interpretability. Emotion-aware music generation systems combine facial expressions, speech prosody, and physiological signals (ECG) to create continuous valence-arousal-dominance representations that enable fine-grained emotional alignment between listeners and musical content [
42]. Neurophysiological studies employing electroencephalography (EEG) have investigated the coherence between perceived and induced emotional responses during music video consumption, revealing over 80% similarity between EEG-derived and user-comment sentiment curves [
43], while spontaneous visual imagery during extended music listening has been associated with reliable alpha suppression in brain activity [
44]. Advanced signal processing techniques utilizing quaternion representations and vector-sensor arrays have demonstrated enhanced performance in music information retrieval tasks [
45]. Virtual reality technologies are being integrated into music education and interactive experiences, with simulation systems demonstrating significant improvements in student learning efficiency and musical expression mastery [
46]. Collaboration discovery frameworks based on heterogeneous knowledge graphs and neural networks facilitate identification of potential artist partnerships, addressing limitations in data accessibility and supporting cultural exchange in the global music landscape [
47]. Additionally, innovations in aligned music notation and lyrics transcription employ end-to-end deep learning to preserve critical synchronization between musical elements and textual content [
48]. These developments highlight the trend toward comprehensive, interpretable, and multimodal approaches in music information retrieval.
This study addresses an important challenge in interpretable yet accurate music popularity prediction by developing and evaluating comprehensive machine learning models using large-scale data from Spotify’s streaming platform. The music popularity prediction task is formulated as a supervised regression problem, where the objective is to learn a mapping from high-dimensional audio feature vectors to popularity scores by minimizing expected prediction error through empirical risk minimization on training data. Tree ensemble models are chosen for their balance of predictive performance and interpretability on tabular data: they provide inherent feature importance metrics, are compatible with post hoc explanation methods (SHAP), naturally handle mixed-scale features without extensive preprocessing, and demonstrate strong generalization on moderate-sized datasets, whereas deep neural networks may achieve marginal gains but lack transparency in feature–popularity relationships and require substantially larger training data. The specific research objectives are: (1) systematically comparing six machine learning regression algorithms (Random Forest, XGBoost, CatBoost, LightGBM, Extra Trees, and Decision Tree) to identify the most effective approach for popularity score prediction; (2) implementing Bayesian hyperparameter optimization with a Tree-structured Parzen Estimator to maximize predictive performance while ensuring robust model selection through cross-validation; (3) conducting comprehensive feature importance analysis to identify critical audio characteristics and metadata attributes that drive popularity; (4) applying Shapley Additive Explanations (SHAP) analysis to provide theoretically grounded, instance-level explanations of model predictions, addressing the interpretability challenge; and (5) analyzing temporal effects and content classification impacts on popularity through statistical visualizations to uncover non-linear relationships and interaction effects. By integrating Bayesian-tuned ensemble learning with game-theoretic interpretability methods, this work provides a well-validated benchmark and an end-to-end pipeline—from feature extraction through Bayesian-optimized model selection to SHAP-based interpretation—for tabular music popularity prediction, offering empirical observations on the audio and temporal factors associated with track popularity within the scope of the current dataset.
2. Materials and Methods
The overall methodology framework for this study comprises four sequential stages: (1) data acquisition from Spotify API and datasets; (2) data preprocessing and feature engineering; (3) model training with Bayesian hyperparameter optimization; and (4) model interpretability and analysis through feature importance and SHAP methods.
Figure 1 illustrates the complete workflow, detailing the data flow from raw music tracks through preprocessing steps, ensemble model training, to final interpretability analysis and performance evaluation.
2.1. Problem Formulation
The task is formulated as regression (rather than ranking or time-series prediction) based on the continuous nature of the target variable (Spotify popularity score: 0–100) and the cross-sectional structure of the dataset (single time-point snapshot without longitudinal repeated measurements). Regression enables prediction of absolute popularity values useful for recommendation thresholding and A/B testing, whereas ranking would provide only ordinal comparisons, and time-series methods would require temporal sequences unavailable in our data structure.
The music popularity prediction task is mathematically formulated as a supervised regression problem. Given a dataset
where
represents the
d-dimensional feature vector (audio features and metadata) for track
i and
represents the corresponding popularity score, the objective is to learn a function
that minimizes the expected prediction error:
where
is the hypothesis space (ensemble of decision trees in this study),
is a loss function (squared error for regression), and
is the unknown joint distribution of features and popularity scores. In practice, the true distribution
is unknown, so empirical risk minimization is performed on the training set
:
where
is the number of training samples. The learned function
approximates the optimal function
and is evaluated on an independent test set to assess generalization performance.
2.2. Data Collection and Preprocessing
Music popularity data for this study were obtained from Spotify’s comprehensive music database, which provides standardized audio feature measurements and popularity metrics for millions of tracks across various genres and time periods. The dataset was collected via the Spotify API with comprehensive audio characteristics and metadata. The target variable for prediction is the popularity score, a numerical value ranging from 0 to 100 that reflects a track’s current streaming performance and listener engagement on the Spotify platform.
The dataset includes 16 audio features and metadata attributes: track ID, track name, popularity score (0–100), duration (ms), explicit content flag (binary), artist names, artist IDs, release date, danceability (0–1), energy (0–1), key (0–11), loudness (dB), mode (binary), speechiness (0–1), acousticness (0–1), instrumentalness (0–1), liveness (0–1), valence (0–1), tempo (BPM), time signature (3–7), and days since latest release. These parameters characterize comprehensive music characteristics through acoustic, rhythmic, and tonal indicators. The temporal feature days_since_latest_release was computed as the integer number of days between each track’s release date (extracted from Spotify’s release_date field in the YYYY-MM-DD format) and the data collection snapshot date (16 April 2021), representing the time elapsed since release at the moment of data acquisition.
Comprehensive data preprocessing was conducted to ensure data quality and model reliability. Non-predictive features including track ID, track name, artist names, artist IDs, and release date were removed from the dataset, as these categorical and identifier variables do not contribute meaningful information for popularity prediction modeling. Missing values were addressed through complete case deletion, removing samples with any missing values to maintain data integrity.
All tracks including those with zero popularity scores were retained in the final dataset. Zero-popularity tracks may represent newly released content with insufficient streaming history, long-tail music with low exposure, or tracks from emerging artists, all of which constitute legitimate components of the streaming music ecosystem that prediction models should handle. This approach provides comprehensive coverage of the full popularity distribution (0–100).
After preprocessing, the final dataset comprises 170,366 tracks with 15 audio feature variables, covering the complete popularity spectrum (0–100) and providing comprehensive coverage of the full streaming music ecosystem including zero-popularity tracks. All features were standardized using z-score normalization (StandardScaler) to ensure equal contribution scales across parameters with different measurement units. For each feature
, the standardized feature
is computed as:
where
represents the mean of feature
j across all
n samples, and
represents the standard deviation. This transformation ensures that each standardized feature has zero mean (
) and unit variance (
), improving model convergence and prediction stability. The dataset was first split into training (70%) and testing (30%) sets using stratified random sampling. The StandardScaler was then fitted on the training set to compute
and
, and these parameters were applied to transform both training and testing sets. During 5-fold cross-validation within the training set, the scaler was independently re-fitted on each training fold and applied to the corresponding validation fold.
This study’s methodology encompasses three stages: (1) data acquisition and preprocessing (missing value handling and feature standardization with a proper train-test split protocol); (2) model development involving 70:30 train–test splitting, Bayesian hyperparameter optimization with 5-fold cross-validation, and model evaluation; and (3) model interpretability analysis using feature importance and SHAP methods to clarify audio feature contributions to popularity prediction. All analyses were implemented in Python 3.8 with scikit-learn 1.0, XGBoost 1.5, CatBoost 1.0, LightGBM 3.3, and shap 0.41.0 libraries.
2.3. Machine Learning Modeling
To predict music popularity scores, six distinct machine learning regression models were utilized in this study: Random Forest, XGBoost, CatBoost, LightGBM, Extra Trees, and Decision Tree. These algorithms span a wide range of computational paradigms from traditional tree methods to advanced gradient boosting ensemble strategies. For regression tasks, the squared error loss function is employed:
which quantifies the discrepancy between the predicted popularity
and the true popularity
y. The overall training objective minimizes the empirical risk with mean squared error (MSE):
Ensemble methods are particularly well-suited for music popularity prediction due to their ability to capture complex non-linear relationships between audio features and popularity metrics. The expected prediction error of a model can be decomposed into bias, variance, and irreducible error components:
where
measures systematic prediction errors,
captures model sensitivity to training data variations, and
represents irreducible noise. Ensemble methods primarily reduce variance while maintaining reasonable bias levels.
Random Forest (RF) constructs multiple decision trees during training and outputs the mean prediction of individual trees for regression tasks. Each tree is trained on a bootstrap sample of the data, and random feature subsets are considered at each split, reducing overfitting and improving generalization. Formally, Random Forest constructs an ensemble of
B decision trees
, where each tree is trained on a bootstrap sample
drawn with replacement from the training set
, leaving approximately 36.8% of samples as out-of-bag (OOB) instances for internal validation. At each split node, a random subset of
features is considered, introducing additional randomness beyond bagging. The final prediction aggregates individual tree predictions through averaging:
The ensemble averaging reduces prediction variance, where random feature selection at each split decorrelates the trees, thereby decreasing overall ensemble variance. Extra Trees extends the Random Forest approach by introducing additional randomness in both the sample selection and split threshold selection, often leading to reduced variance at the cost of slightly increased bias. Instead of searching for the optimal split threshold, Extra Trees randomly selects split thresholds from a uniform distribution over the feature range at each node [
49]. This randomization further decorrelates trees and reduces computational complexity.
XGBoost (Extreme Gradient Boosting) implements an optimized distributed gradient boosting framework that builds trees sequentially, with each new tree correcting errors made by the previous ensemble. The algorithm incorporates advanced regularization techniques (L1 and L2) to prevent overfitting and uses second-order Taylor expansion for loss function optimization, resulting in faster convergence and superior predictive performance. Gradient boosting methods construct the ensemble sequentially by fitting each new tree to the negative gradient of the loss function. The model at iteration
t is expressed as:
where
is the ensemble from the previous iteration,
is the new tree, and
is the learning rate. For XGBoost, the objective function at iteration
t incorporates a second-order Taylor expansion with regularization:
where
and
are the first and second-order gradients of the loss function with respect to the predictions. The regularization term is defined as:
where
T is the number of leaves,
is the weight of leaf
j, and
are regularization parameters controlling model complexity through both tree structure and leaf weights, preventing overfitting to training data.
CatBoost (Categorical Boosting) is a gradient boosting algorithm specifically designed to handle categorical features efficiently through ordered target statistics. Although the music dataset primarily contains continuous audio features, CatBoost’s symmetric tree structure and ordered boosting approach provide robustness against overfitting and effective handling of prediction noise. The ordered target statistic scheme prevents target leakage while utilizing label information through a random permutation-based approach with smoothing parameters.
LightGBM (Light Gradient Boosting Machine) employs a novel gradient-based one-side sampling (GOSS) and exclusive feature bundling (EFB) approach to achieve faster training speed and lower memory consumption while maintaining high accuracy. GOSS retains all instances with large gradients and performs random sampling among instances with small gradients, with an amplification factor compensating for the sampling bias in information gain estimation. The algorithm grows trees leaf-wise rather than level-wise, enabling more efficient exploration of the feature space.
Decision Tree serves as a baseline single-tree model for comparison with ensemble methods. While less robust than ensemble approaches, decision trees provide interpretable decision rules and help quantify the performance gains achieved through ensemble strategies. Mathematically, a decision tree partitions the feature space
into
M disjoint regions
through recursive binary splitting. The prediction function can be expressed as:
where
is the predicted value for region
,
is the indicator function, and for regression tasks,
represents the mean target value in region
. The optimal split at each node is determined by minimizing the sum of squared residuals:
where
j denotes the splitting feature,
s denotes the splitting threshold, and
,
are the resulting child regions.
For machine learning modeling, dataset partitioning into training and testing subsets is essential. The preprocessed dataset was split into 70% training data (for model learning) and 30% testing data (for independent performance evaluation) using stratified sampling to maintain popularity distribution consistency. Formally, the complete dataset
is partitioned into disjoint subsets:
where
and
. Models derive patterns from training samples, capturing input–output feature relationships to build predictive models for popularity assessment. The testing subset then evaluates each model’s performance and generalization ability on unseen data. The evaluation protocol strictly separates model selection from final testing: hyperparameter optimization uses 5-fold cross-validation exclusively within the training set to prevent data leakage, and the test set remains completely untouched until final evaluation to ensure unbiased performance estimates. To ensure reproducibility, all experiments use consistent random seeds for dataset splitting, hyperparameter optimization, cross-validation fold generation, and interpretability analysis sampling.
Hyperparameter tuning was performed to find optimal hyperparameters for all six models. Bayesian optimization with Tree-structured Parzen Estimator (TPE) was implemented using the Optuna framework (version 3.1.0), conducting 50 optimization trials per model. The optimization objective was to minimize Root Mean Squared Error (RMSE) on a validation set. Bayesian optimization treats hyperparameter tuning as a sequential decision problem, modeling the objective function
(RMSE on validation set) as a Gaussian process. At each iteration
t, the Tree-structured Parzen Estimator constructs two density models:
where
is a quantile threshold,
models configurations yielding good performance, and
models poor configurations. The next hyperparameter configuration is selected by maximizing the Expected Improvement (EI) acquisition function:
Equivalently, TPE selects
, favoring regions with high probability under
and low probability under
. Five-fold cross-validation was incorporated into the optimization search to prevent overfitting and ensure robust hyperparameter selection. Cross-validation partitions the training set
into five disjoint subsets
of approximately equal size, where for each fold
k, the model is trained on
and validated on
. The cross-validated performance is computed as:
where
is the root mean squared error on fold
k. The hyperparameter search ranges and optimal values for each model are detailed in
Table 1.
2.4. Model Evaluation
To evaluate the performance of the developed machine learning regression models, three widely used metrics were adopted: Root Mean Squared Error (RMSE), Mean Absolute Error (MAE), and coefficient of determination (
). The mathematical formulations for these metrics are given as follows.
Here, n denotes the number of samples in the test set, represents the actual popularity score, represents the predicted popularity score, and represents the mean of actual popularity scores. RMSE measures the square root of the average squared differences between predicted and actual values, with higher penalty for larger errors. MAE measures the average absolute differences between predictions and actual values, providing a more interpretable error metric. measures the proportion of variance in the popularity scores that is predictable from the audio features, with values ranging from 0 to 1, where higher values indicate better model fit.
2.5. Feature Importance Analysis
Understanding which audio features contribute most significantly to popularity prediction is crucial for both model interpretation and practical music industry applications. Two complementary approaches were employed to analyze feature importance: model-specific importance measures and unified SHAP analysis.
2.5.1. Model-Specific Feature Importance
Feature importance was extracted using model-specific methods tailored to each algorithm’s internal structure. Tree models (Random Forest, XGBoost, CatBoost, LightGBM, Extra Trees, and Decision Tree) provide intrinsic importance measures based on the frequency and effectiveness of feature splits across all trees in the ensemble. For Random Forest and Extra Trees, the Gini importance (mean decrease in impurity) for feature
j is computed as:
where
B is the number of trees,
is the set of nodes in tree
b that split on feature
j,
is the number of samples reaching node
v,
N is the total number of samples, and
is the Gini impurity at node
v with
being the proportion of class
k samples. For regression tasks, the variance reduction replaces Gini impurity. The importance score for each feature quantifies its contribution to reducing prediction error across all decision nodes. For ensemble models, importance values are averaged across all trees in the forest or boosting sequence, providing a robust measure of feature relevance.
Feature importance rankings were normalized to sum to 1.0 for comparison across models, enabling direct assessment of relative feature contributions. It is important to note that different tree models use distinct importance measures: Random Forest uses mean decrease in Gini impurity, XGBoost uses total gain, LightGBM uses split-weighted gain, CatBoost uses PredictionValuesChange, Extra Trees uses Gini-based importance, and Decision Tree uses feature contribution to node impurity reduction. Our analysis focuses on the relative ranking of features rather than comparing absolute importance scores across models, as the ordinal rankings provide more robust insights into which features are most predictive regardless of the specific computation method. Consistent rankings across multiple algorithmic approaches provide stronger evidence for genuine feature relevance, while divergent rankings may indicate model-specific biases or algorithmic differences in handling feature interactions.
2.5.2. SHAP Analysis for Model Interpretability
To provide unified and theoretically grounded feature importance analysis, SHAP (SHapley Additive exPlanations) was applied to the Random Forest model. Random Forest was selected for SHAP analysis because it achieved the best predictive performance among all evaluated models (RMSE = 6.79, MAE = 5.10, and = 0.6658), providing the most reliable basis for understanding feature contributions to successful popularity predictions. SHAP values are based on cooperative game theory, specifically the Shapley value concept from coalition games, which fairly distributes the contribution of each feature to individual predictions.
The SHAP framework decomposes each prediction into additive feature contributions:
where
is the model prediction (popularity score),
is the base value (mean model prediction over the held-out test set), and
represents the SHAP value (contribution) of audio feature
i. The Shapley value
for feature
i is rigorously defined as:
where
is the set of all features,
S represents a subset of features (coalition),
is the expected model output conditioned on features in
S, and
denotes the cardinality of set
S. This formulation computes the weighted average of marginal contributions across all possible feature coalitions. The SHAP values satisfy three fundamental axioms: efficiency (contributions sum to the difference between prediction and base value), symmetry (features with identical contributions receive equal SHAP values), and dummy (irrelevant features receive zero contribution).
The conditional expectation
is computed as:
where
represents features not in
S and
is the conditional distribution.
For tree models like Random Forest, SHAP values are computed using TreeExplainer, which efficiently calculates exact Shapley values by leveraging the tree structure without explicit enumeration of all coalitions. The algorithm traces all possible paths through the decision trees and computes the marginal contribution of each audio feature across all possible feature coalitions. For a tree ensemble , the SHAP value decomposes additively across trees: , where is the SHAP value for feature i in tree t, enabling efficient parallel computation. In this study, shap.TreeExplainer from the Python shap library was applied directly to the fitted Random Forest ensemble. Since TreeExplainer derives Shapley values analytically from the tree structure, no separate background dataset is required. SHAP values were computed on the held-out test set, so that the feature attribution results reflect the model’s behavior on unseen data rather than on samples seen during training. SHAP analysis was conducted exclusively for the best-performing Random Forest model; the deep learning and baseline models included in the comparison table were not subjected to SHAP analysis.
SHAP analysis provides both global feature importance (aggregated across all test set predictions) and local explanations (individual track predictions). Summary plots visualize the distribution of SHAP values for each feature, revealing not only feature importance but also the direction and magnitude of feature effects. Dependence plots illustrate the relationship between feature values and their SHAP contributions, enabling identification of non-linear patterns and interaction effects between audio characteristics.
3. Results and Discussion
3.1. Data Statistical Results and Visualization
The comprehensive statistical analysis of music popularity and audio features is presented in
Figure 2, which provides critical insights into the dataset characteristics and relationships between audio parameters and popularity metrics. The analysis encompasses popularity distribution patterns, feature correlation structures, and temporal trends that inform model development and interpretation.
Figure 2a illustrates the popularity score distribution across the entire dataset, revealing an approximately normal distribution with a mean popularity of 39.4 and a median of 39.0. The kernel density estimation (KDE) curve demonstrates that most tracks concentrate in the 20–60 popularity range, with a relatively symmetric distribution around the central tendency. By including the complete popularity spectrum including zero-popularity tracks (3.7% of the dataset), the distribution captures the full landscape of streaming music, from newly released content with minimal exposure to highly popular tracks achieving widespread engagement. This comprehensive distribution pattern reflects the actual heterogeneity of music streaming platforms, where tracks span from long-tail content with limited reach to mainstream releases with substantial listener engagement.
The feature correlation matrix presented in
Figure 2b reveals the relationships between popularity and key audio features. Notably, the correlation coefficients between individual audio features and popularity are generally modest, with loudness showing the strongest positive correlation with popularity, followed by energy and danceability. Energy demonstrates a strong positive correlation with loudness, reflecting the physical relationship between sound intensity and perceived energy. Danceability shows a moderate positive correlation with valence and energy, indicating that more energetic and positive-valenced tracks tend to be more danceable. The generally modest correlations between individual features and popularity underscore the necessity of machine learning approaches, as popularity prediction requires capturing complex non-linear interactions between multiple audio characteristics rather than relying on single-feature relationships.
Figure 2c examines the relationship between loudness and popularity, colored by energy levels. The scatter plot reveals a clear positive trend, with louder tracks generally achieving higher popularity scores. The trend line demonstrates that popularity increases approximately from 20 to 50 as loudness increases from −40 dB to 0 dB. The color gradient indicates that higher-energy tracks (yellow-green) tend to cluster at higher loudness levels, suggesting that energetic, loud tracks may have advantages in capturing listener attention and achieving streaming success. This finding aligns with the “loudness war” phenomenon in music production, where increased loudness is often pursued to enhance perceived impact and competitiveness.
Figure 2d presents violin plots for five key audio features, revealing their distribution characteristics. Danceability, energy, and valence all show relatively wide distributions spanning most of the 0–1 range, indicating diverse musical styles in the dataset. Acousticness exhibits a strongly right-skewed distribution with most tracks having low acousticness (median ≈ 0.2), reflecting the dominance of electronically produced or amplified music in modern streaming platforms. Liveness shows the most concentrated distribution with very low values (median ≈ 0.1), indicating that most tracks in the dataset are studio recordings rather than live performances.
The temporal analysis in
Figure 2e reveals substantial increases in mean popularity over time, rising from approximately 26 in 1990 to nearly 48 in 2020. This trend reflects multiple factors including the growth of streaming platforms, changing listener demographics, and potential selection biases where older tracks in the dataset may represent only the most enduringly popular classics while recent releases include a broader range of popularity levels. The 95% confidence interval band, computed as
where
is the mean popularity,
the within-year standard deviation, and
the number of tracks released in year
t, widens in recent years, indicating increased variability in track popularity as the music industry becomes more diverse and fragmented.
Figure 2f analyzes the relationship between audio activity levels (composite score of danceability, energy, and valence) and explicit content. The results demonstrate that explicit tracks consistently achieve higher mean popularity across all activity levels, with the gap most pronounced for medium to high-activity tracks. Very High-activity tracks show a mean popularity of 45.3 for explicit content versus 36.3 for non-explicit, a difference of 9 popularity points. This pattern suggests that explicit content may appeal to specific demographic segments that engage more actively with streaming platforms or that explicit tracks receive preferential algorithmic treatment due to higher engagement metrics.
These statistical patterns provide crucial insights for machine learning model development, indicating that temporal features (days since release), acoustic intensity features (loudness, energy), and content classification (explicit flag) are likely to emerge as important predictors of popularity, while traditional musical structure features (key, time signature, mode) may have limited discriminative power.
3.2. Comparison of Model Performance
Six machine learning regression models were trained and optimized using Bayesian hyperparameter tuning on the training set, with the optimal hyperparameters for each model detailed in
Table 1. Subsequently, these optimal hyperparameters were utilized in their respective models to predict music popularity scores on the independent testing set. The comprehensive performance comparison across RMSE, MAE, and
metrics is presented in
Figure 3.
Figure 3a presents the RMSE comparison across all six models, where lower values indicate better performance. Random Forest achieved the lowest RMSE of 6.79, demonstrating superior predictive accuracy among all tested algorithms. Extra Trees (7.07) and Decision Tree (7.12) also showed competitive performance, followed by XGBoost (7.40), LightGBM (7.75), and CatBoost (7.93). All models maintained RMSE values below 8.0, indicating reasonable prediction accuracy across the popularity spectrum.
Figure 3b illustrates the MAE comparison, which provides a more interpretable measure of average prediction error. Random Forest again demonstrated the best performance with an MAE of 5.10, indicating that on average, predictions deviate by approximately 5.1 popularity points from actual values. Decision Tree (5.30) and Extra Trees (5.46) showed comparable performance, while XGBoost (5.53), LightGBM (5.76), and CatBoost (5.87) exhibited slightly higher average errors but remained within acceptable ranges for practical applications.
Figure 3c displays the
score comparison, where higher values indicate better model fit. Random Forest achieved the highest
of 0.6658, explaining approximately 66.58% of the variance in popularity scores. Extra Trees (
= 0.6378) and Decision Tree (
= 0.6328) demonstrated comparable explanatory power, while XGBoost (
= 0.6027), LightGBM (
= 0.5643), and CatBoost (
= 0.5438) showed progressively lower predictive capability. The
values above 0.54 for all models indicate that audio features combined with temporal and content metadata provide meaningful predictive signals for music popularity. The remaining 34–46% of unexplained variance is likely attributable to external factors such as marketing efforts, artist reputation, playlist placements, social media trends, and cultural dynamics not captured in acoustic features alone.
Figure 4 presents the scatter plots of predicted versus actual popularity scores for all six models on the testing set. Each subplot displays the relationship between model predictions and ground truth values, with the red dashed diagonal line representing perfect predictions and the green solid line showing the actual linear fit. Random Forest (
Figure 4c) demonstrates the tightest clustering around the perfect prediction line, consistent with its lowest RMSE and highest
values. The scatter patterns reveal that all models tend to perform better for mid-range popularity scores (30–50) compared to extreme values, suggesting that tracks with very low or very high popularity may be influenced by factors beyond acoustic features, such as viral marketing campaigns, artist fame, or algorithmic playlist placements. The prediction residuals show relatively homoscedastic patterns across most popularity ranges, indicating that model errors do not systematically increase with popularity magnitude. However, slight underprediction tendencies are observable for highly popular tracks (popularity > 60), where predictions cluster slightly below the perfect prediction line. This systematic bias suggests that exceptional popularity achievements involve non-acoustic factors (artist reputation, social media trends, and playlist curation) that are not captured in the audio feature space.
Figure 5 compares the distributions of actual versus predicted popularity scores for each model using histograms with kernel density estimation curves. The close overlap between actual (blue) and predicted (colored) distributions indicates successful model calibration, where models not only predict individual samples accurately but also preserve the overall population distribution characteristics. All models show good distributional alignment with actual popularity scores, demonstrating that the models successfully learned the underlying popularity distribution across the complete spectrum including zero-popularity and long-tail tracks. The slightly narrower predicted distributions across all models reflect regression toward the mean, a common characteristic where models predict more conservative values for extreme cases, particularly for very low and very high popularity tracks.
3.3. Comparison with Baseline and Deep Learning Models
To assess whether the complexity of tree ensemble models is justified for music popularity prediction, we compared their performance against simpler baseline models and alternative deep learning architectures, all implemented by us under identical experimental conditions on the same dataset and feature set. This within-study comparison provides insights into relative model complexity trade-offs for this specific prediction task, and should not be interpreted as a head-to-head comparison with results published in prior literature, given the differences in datasets, feature sets, and evaluation protocols across studies.
Table 2 presents a comprehensive comparison of all evaluated models, including traditional statistical methods, deep learning approaches, and tree ensembles.
The results reveal several important findings regarding model complexity and performance trade-offs. Tree ensemble models substantially outperform the baseline Logistic Regression model, which achieves only = 0.1455 (RMSE = 10.85). This represents an 84.1% relative improvement in unexplained variance reduction for Random Forest compared to the baseline (from 85.45% unexplained to 33.42% unexplained), demonstrating that the prediction task exhibits strong nonlinear relationships and feature interactions that cannot be adequately captured by linear decision boundaries. The poor performance of Logistic Regression confirms that music popularity prediction requires models capable of capturing complex, hierarchical patterns in the feature space.
Deep learning architectures (Transformer, LSTM, GRU) achieve intermediate performance between tree ensembles and the baseline model, with values ranging from 0.4919 to 0.5107. The Transformer architecture performs best among deep learning models ( = 0.5107), leveraging self-attention mechanisms to capture feature dependencies. However, within this experimental setup, all deep learning models show lower performance than tree ensembles, with Random Forest achieving 30.4% higher than Transformer and 35.4% higher than GRU under the current dataset and protocol. This performance gap can be attributed to several factors: Deep learning models typically require substantially larger training datasets to achieve optimal performance, whereas tree models demonstrate strong generalization with moderate-sized datasets. The tabular nature of audio feature data (15 numerical features) is particularly well-suited to tree splitting and aggregation, whereas deep learning excels with high-dimensional sequential or spatial data. Tree models naturally handle mixed-scale features and do not require extensive hyperparameter tuning for feature preprocessing, whereas deep learning models are sensitive to normalization schemes and architectural choices.
Among tree models, ensemble methods (Random Forest, Extra Trees, XGBoost, LightGBM, and CatBoost) consistently outperform single Decision Trees ( = 0.6328), though the performance gap is relatively modest (5.2% improvement for Random Forest). This indicates that while ensemble aggregation provides measurable benefits through variance reduction and better generalization, the underlying tree structure itself captures the majority of the predictive signal. The strong performance of Decision Trees relative to deep learning models ( = 0.6328 vs. 0.4919–0.5107) further confirms that hierarchical feature partitioning is fundamentally well-aligned with the structure of music popularity prediction.
These findings support the use of tree ensemble models for this prediction task within the current experimental setup. The consistently strong performance over baseline and deep learning alternatives, combined with inherent interpretability through feature importance and SHAP analysis, indicates that tree ensembles are a competitive methodological choice for this type of tabular audio feature prediction task. The comparison also suggests that model complexity is warranted in this setting—simpler baseline models appear unable to capture the nonlinear relationships and feature interactions governing music popularity, while more complex deep learning models do not provide sufficient performance gains to justify their computational costs and reduced interpretability for tabular audio feature data. These observations are specific to the present dataset and protocol and may not generalize to all music popularity prediction contexts.
3.4. Feature Importance
Understanding the relative contribution of audio features to popularity prediction provides critical insights for both model interpretability and music industry applications. Feature importance analysis reveals which audio characteristics and metadata serve as the most discriminative indicators for popularity assessment across different machine learning algorithms. The most striking finding across all models is the dominant importance of days_since_latest_release, which consistently ranks as the most influential predictor of popularity (
Figure 6). This temporal feature captures the recency of track releases relative to the dataset collection timepoint, reflecting the strong bias toward contemporary music in streaming platforms where recent releases benefit from algorithmic promotion, playlist placements, and current listener preferences. The overwhelming importance of this feature (importance scores 2–3× higher than any other feature) suggests that popularity is as much a temporal phenomenon as an acoustic one, with older tracks facing systematic disadvantages in streaming engagement regardless of their sonic qualities. Among acoustic features, loudness emerges as the second most important predictor across all tree models. This finding aligns with the “loudness war” in music production, where commercially oriented tracks are mastered to maximize perceived intensity. The consistent importance of loudness suggests that streaming platform listeners preferentially engage with tracks that sound subjectively louder, potentially due to attention-grabbing effects when played alongside quieter tracks in playlists or radio modes. The explicit content flag demonstrates notable importance across models, particularly in Random Forest and Decision Tree algorithms. This binary indicator captures linguistic content restrictions and may serve as a proxy for genre preferences (explicit content is more common in hip-hop, rap, and certain electronic subgenres) or demographic targeting (younger audiences may preferentially engage with explicit content). The moderate-to-high importance of this feature suggests that content classification plays a meaningful role in popularity beyond pure acoustic characteristics.
Energy, duration, and acousticness show moderate importance across models, indicating secondary but meaningful contributions to popularity prediction. Energy’s importance likely reflects listener preferences for high-intensity tracks in workout, party, and commuting contexts. Duration’s relevance suggests optimal track length effects, where excessively short or long tracks may be disadvantaged in algorithmic recommendations or listener completion rates. Acousticness importance may reflect genre effects, as highly acoustic tracks (folk, singer-songwriter, and classical) typically occupy different popularity distributions than electronic or heavily produced music. Traditional musical structure features including key, mode, time_signature, danceability, speechiness, tempo, valence, instrumentalness, and liveness consistently show low importance across all models. The minimal predictive value of these features suggests that popularity is largely independent of fundamental musical characteristics such as tonality, rhythmic structure, or emotional valence. This finding challenges simplistic notions that certain keys (e.g., C major) or tempos (e.g., 120 BPM) inherently produce more popular music, instead indicating that popularity arises from complex interactions between production quality, marketing, artist reputation, and cultural context that are not captured by these acoustic descriptors alone. The algorithmic differences in feature importance rankings reveal complementary perspectives on parameter significance. Tree ensemble methods (Random Forest, Extra Trees, XGBoost) show highly consistent rankings, emphasizing days_since_latest_release and loudness. Decision Tree demonstrates slightly more distributed importance across features, potentially reflecting its single-tree structure’s limited ability to capture feature interactions compared to ensemble methods.
Within the scope of this dataset and experimental context, these feature importance patterns offer suggestive empirical patterns that may provide directional reference for music information retrieval researchers and practitioners. The dominance of temporal features suggests that recommendation algorithms may benefit from accounting for release recency when comparing tracks across different time periods, though this observation is specific to the current Spotify snapshot and may not generalise to other platforms or time frames. The secondary importance of loudness and explicit content suggests that production decisions and content classification have measurable impacts on streaming success within this dataset. The minimal importance of traditional musical characteristics implies that popularity prediction requires features beyond conventional music theory descriptors, potentially including artist reputation metrics, playlist inclusion data, social media engagement, and marketing expenditure.
3.5. Shapley Additive Explanations (SHAP)
To provide deeper insights into the decision-making mechanisms of the best-performing Random Forest model, Shapley Additive Explanations (SHAP) analysis was conducted to quantify individual feature contributions to specific predictions and reveal the direction and magnitude of feature effects on popularity. SHAP analysis offers a unified framework for model interpretability by decomposing each prediction into additive contributions from individual features, providing both global feature importance and local explanations of individual predictions.
Figure 7a presents the global SHAP feature importance analysis based on mean absolute SHAP values, confirming the dominance of days_since_latest_release with importance exceeding 5.0, substantially higher than any other feature. Loudness ranks second with importance around 1.0, followed by explicit content, energy, and duration with moderate importance values between 0.5 and 1.0. The remaining features (acousticness, valence, instrumentalness, tempo, mode, danceability, speechiness, liveness, key, time_signature) show minimal importance below 0.5, consistent with the model-specific feature importance analysis.
Figure 7b displays the SHAP summary plot revealing the distribution of feature impacts on model predictions. Each point represents a single track, with position along the x-axis indicating the SHAP value (impact on predicted popularity) and color representing feature value magnitude. The days_since_latest_release plot shows a clear pattern where low values (blue, representing recent releases) concentrate on the positive SHAP side, indicating that recent releases strongly increase predicted popularity, while high values (red, older releases) concentrate on the negative side, strongly decreasing predicted popularity. This asymmetric distribution confirms the substantial temporal bias in popularity prediction. The loudness summary plot demonstrates that higher loudness values (red points) generally produce positive SHAP contributions, while lower loudness (blue points) tend toward negative contributions. This relationship validates the importance of perceived intensity in popularity achievement. Interestingly, the SHAP value distribution shows some overlap, indicating that loudness effects are modulated by interactions with other features rather than operating as a simple linear predictor. The explicit content feature shows a binary distribution (0 or 1 values), with explicit tracks (red points concentrated at feature value = 1) predominantly exhibiting positive SHAP contributions, confirming the popularity advantage for explicit content observed in the descriptive statistics. Energy displays a dispersed pattern where higher energy values tend toward positive SHAP contributions, though with substantial variability suggesting complex interactions with other acoustic features. Duration shows an inverted U-shaped pattern, where both very short and very long tracks receive negative SHAP contributions, while moderate durations receive positive contributions. This non-linear relationship suggests optimal track length effects that may reflect listener attention spans, algorithmic playlist preferences, or genre-specific conventions.
Figure 8 presents SHAP dependence plots for the top five most important features, revealing the functional relationships between feature values and their contributions to popularity predictions. The days_since_latest_release plot (
Figure 8a) shows a clear non-linear declining pattern, with SHAP values decreasing sharply from approximately +15 for newly released tracks to −10 for tracks released over 10,000 days ago. This dramatic decline quantifies the temporal decay in popularity potential, where tracks lose approximately 2.5 SHAP points (equivalent to 2.5 popularity score points) per 1000-day increase in age. The relationship exhibits steepest decline in the 0–5000 day range, followed by gradual flattening for older releases, suggesting that temporal disadvantage accumulates most rapidly in the first decade after release. The loudness dependence plot (
Figure 8b) demonstrates a U-shaped relationship with inflection point around −10 dB. Tracks with loudness below −30 dB receive negative SHAP contributions around −3, while tracks approaching 0 dB receive positive contributions up to +4. This non-linear pattern indicates that loudness effects are non-monotonic, with optimal loudness in the −5 to 0 dB range, while both very quiet (<−20 dB) and moderately quiet (−20 to −10 dB) tracks suffer popularity penalties. The explicit content dependence plot (
Figure 8c) shows a simple binary pattern where explicit = 1 receives positive SHAP contributions (median ≈ +3) while explicit = 0 receives near-zero contributions. This clean separation confirms that explicit content classification provides a consistent popularity boost across the dataset, though with substantial variance (SHAP values ranging from −5 to +10), indicating that explicit content effects are modulated by interactions with other features such as genre, artist, or release year. The instrumentalness and duration dependence plots (
Figure 8d,e) reveal more complex patterns. Instrumentalness (
Figure 8d) displays a predominantly negative relationship, where most tracks cluster near zero instrumentalness with near-zero SHAP contributions, while tracks with high instrumentalness values (approaching 1.0) exhibit strongly negative SHAP contributions (as low as −15), indicating that highly instrumental tracks face systematic popularity penalties on streaming platforms dominated by vocal-centric content. Duration (
Figure 8e) shows a declining pattern where tracks with moderate durations (below 200,000 ms) maintain near-zero SHAP contributions, while very long tracks (exceeding 500,000 ms) receive substantial negative SHAP contributions, suggesting that excessively long tracks are disadvantaged in streaming popularity due to listener attention constraints and algorithmic preferences.
Figure 9 presents waterfall plots for two representative tracks illustrating individual prediction decomposition.
Figure 9a shows a high-popularity prediction (f(x) = 49.894) driven primarily by a strong positive contribution from days_since_latest_release (+11.08), indicating a recently released track. Additional positive contributions from loudness (−4.51, SHAP value +1.21), instrumentalness (+1.2), and acousticness (+0.47) further boost the prediction, while negative contributions from duration_ms (−2.23), explicit (−0.6), and energy (−0.45) partially offset the temporal advantage. The base value (E[f(x)] = 39.404) represents the model’s average prediction across all samples, with individual feature contributions summing to the final prediction.
Figure 9b illustrates an above-medium prediction (f(x) = 42.208) with more balanced contributions across features. Loudness (−5.979, shown as +1.23 contribution) and days_since_latest_release (+1.07) provide moderate positive effects, along with duration_ms (+0.46) and instrumentalness (+0.35), while explicit content (−0.39), danceability (−0.2), and valence (−0.18) exert small negative effects. This example demonstrates how the model integrates multiple feature effects to arrive at nuanced predictions, with temporal recency providing important baseline signal that is modulated by acoustic characteristics, production features, and content attributes.
The SHAP analysis provides several empirical observations that may offer directional reference for music information retrieval researchers and recommendation system developers. The pronounced dominance of temporal features indicates that any fair comparison of track popularity across different release periods requires explicit temporal normalization or cohort-specific analysis. The non-linear loudness relationship suggests that there may exist production patterns associated with higher popularity within this dataset, with diminishing returns beyond certain loudness thresholds. The complex interaction patterns revealed in dependence plots indicate that popularity cannot be optimized through single-feature manipulation but requires holistic consideration of production, content, and timing decisions. From a model interpretability perspective, the SHAP analysis confirms that Random Forest’s strong performance stems from its ability to capture complex non-linear relationships and feature interactions that are not apparent in linear models. The varying importance patterns and functional relationships provide interpretable signals within the scope of the present study while highlighting its limitations in capturing factors beyond acoustic and temporal features.
4. Conclusions
This study systematically evaluated six tree ensemble models for music popularity prediction on large-scale Spotify data, with Random Forest achieving optimal performance ( = 0.6658, RMSE = 6.79, and MAE = 5.10), explaining 66.58% of popularity variance. Within our experimental setup, tree ensembles demonstrated substantially stronger performance than baseline logistic regression (84.1% relative improvement in unexplained variance reduction) and the deep learning architectures evaluated here (30.4–35.4% higher than Transformer/LSTM/GRU), indicating their competitive suitability for this type of tabular music feature prediction task under the current dataset and protocol. Interpretability analysis through SHAP and feature importance revealed that temporal recency (days_since_latest_release) dominates popularity prediction with contributions 2–3× higher than any acoustic feature, indicating that streaming popularity is primarily governed by temporal exposure dynamics and algorithmic promotion favoring recent releases. Among acoustic attributes, loudness exhibited the strongest predictive power, with a non-linear U-shaped relationship peaking at to 0 dB, while traditional music-theoretic descriptors (key, mode, and time signature) provided minimal predictive value. Within this dataset and experimental context, these findings suggest that tree ensemble methods are well-suited for interpretable music popularity modeling and indicate a prominent role of temporal and production factors over intrinsic musical characteristics.
Several limitations warrant consideration. The dominance of temporal recency creates confounding between release age and intrinsic musical characteristics, limiting isolation of stable acoustic–popularity relationships across different release periods. The cross-sectional snapshot (16 April 2021) may not generalize to future periods as platform algorithms and listener preferences evolve. The dataset excludes major real-world success drivers including marketing investment, playlist placement, artist reputation, and social media engagement, with 33–44% of the unexplained variance likely reflecting these unmeasured factors. The evaluation protocol using stratified random sampling does not control for artist-level dependencies or temporal evolution, potentially yielding optimistic estimates if models exploit artist-specific patterns present in both training and test sets. Because the dataset is drawn from a large pool of artists, it is highly likely that tracks from the same artist appear in both the training and test partitions under random splitting; artist IDs were removed as input features before model training, so the model cannot directly memorise artist identity, but correlated acoustic patterns within an artist’s catalogue may still be learned, providing an indirect source of train–test leakage. An artist-grouped split, which would ensure no artist appears in both partitions, was not implemented in this study and represents an important direction for future work. Finally, all findings derive from a single platform source (Spotify) and a cross-sectional sampling frame; their generalizability to other streaming platforms or data collection designs has not been validated.
Future research should address these limitations through controlled experiments isolating temporal effects via ablation studies or age-stratified evaluation, longitudinal studies assessing temporal generalization across multiple time points, and multimodal frameworks incorporating artist attributes, social media dynamics, and playlist context. Advanced evaluation protocols including artist-grouped splits and time-based validation would provide deployment-realistic performance estimates. Temporal modeling approaches such as time-series methods or survival analysis could better capture dynamic popularity evolution, while causal inference frameworks would enable principled assessment of feature effects and mitigation of algorithmic biases in recommendation systems.