A Novel Framework Leveraging Social Media Insights to Address the Cold-Start Problem in Recommendation Systems

Celik, Enes; Omurca, Sevinc Ilhan

doi:10.3390/jtaer20030234

Open AccessArticle

A Novel Framework Leveraging Social Media Insights to Address the Cold-Start Problem in Recommendation Systems

by

Enes Celik

^1,2,*,†

and

Sevinc Ilhan Omurca

^1,†

¹

Computer Engineering Department, Faculty of Engineering, Kocaeli University, Kocaeli 41001, Türkiye

²

Computer Science Department, Babaeski Vocational School, Kirklareli University, Kirklareli 39200, Türkiye

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

J. Theor. Appl. Electron. Commer. Res. 2025, 20(3), 234; https://doi.org/10.3390/jtaer20030234

Submission received: 5 May 2025 / Revised: 20 July 2025 / Accepted: 12 August 2025 / Published: 2 September 2025

(This article belongs to the Collection Utilizing Models for e-Business Decision-Making: From Data to Wisdom)

Download

Browse Figures

Versions Notes

Abstract

In today’s world, with rapidly developing technology, it has become possible to perform many transactions over the internet. Consequently, providing better service to online customers in every field has become a crucial task. These advancements have driven companies and sellers to recommend tailored products to their customers. Recommendation systems have emerged as a field of study to ensure that relevant and suitable products can be presented to users. One of the major challenges in recommendation systems is the cold-start problem, which arises when there is insufficient information about a newly introduced user or product. To address this issue, we propose a novel framework that leverages implicit behavioral insights from users’ X social media activity to construct personalized profiles without requiring explicit user input. In the proposed model, users’ behavioral profiles are first derived from their social media data. Then, recommendation lists are generated to address the cold-start problem by employing Boosting algorithms. The framework employs six boosting algorithms to classify user preferences for the top 20 most-rated films on Letterboxd. In this way, a solution is offered without requiring any additional external data beyond social media information. Experiments on a dataset demonstrate that CatBoost outperforms other methods, achieving an F1-score of 0.87 and MAE of 0.21. Based on experimental results, the proposed system outperforms existing methods developed to solve the cold-start problem.

Keywords:

boosting algorithms; cold-start problem; implicit knowledge; recommendation systems; social media

1. Introduction

With the rapid growth of the digital landscape, users face a plethora of choices across all areas of interest. This abundance, however, often leads to information overload and decision paralysis. To mitigate this challenge, it is crucial to narrow down user options and present only the relevant and personalized alternatives. This approach enhances user engagement and loyalty in digital platforms.

Recommendation systems (RSs) are designed to deliver tailored content and services to users within the online environment. They achieve this by leveraging diverse and extensive datasets to filter vast amounts of online content, thereby offering a solution to the problem of choice overload by presenting items that align with individual preferences. RSs process various inputs, performs complex operations on them, and generates suggestions using these transformed values. Some systems formulate recommendations by evaluating the relationships between items, aiming to identify optimal matches [1]. Others derive recommendations by analyzing users’ past behaviors.

Recommendation systems are highly beneficial for users, businesses, and vendors alike [2]. For instance, they streamline the process for users to access and make choices about items on a film platform. The widespread adoption of RS is attributed to their ability to shorten the decision-making time for users, provide practical suggestions that enable informed customer choices, boost the profitability of companies, and cultivate stronger user loyalty. RSs boast a broad range of applications, being actively utilized in fields such as e-commerce, music streaming, television series platforms, tourism, and digital libraries [3].

The primary objective of recommendation systems is to suggest the most appropriate items to users. To achieve this, RSs rely on a variety of data, which can include publicly available evaluation data or information derived from user interactions and relationships [4]. A significant challenge arises when a new user joins a system, and no historical data or past ratings are available for them. Consequently, it becomes difficult for the recommendation system to make inferences or generate personalized offers for such users [5]. Similarly, the introduction of a new item into the environment, without any existing interaction history, poses challenges for accurately recommending it to suitable users. This predicament is commonly referred to as the cold-start problem within recommendation systems.

Leveraging external data sources offers a potential solution for addressing the cold-start problem [6]. The absence of internal historical data for a new user can be compensated by creating a user profile using this external information. Many modern websites employ various strategies to utilize such external data [7]. For example, X (formerly Twitter) requests users to provide information on topics of interest during registration, and Netflix prompts new members to select three popular films upon their initial sign-up. However, directly asking users for this external information can be intrusive, and users may be reluctant to provide answers during the initial stage. Therefore, given the pervasive influence of social media, these platforms emerge as valuable supplementary data sources for creating user profiles. For these external sources, users would merely need to provide links to their social media accounts. The fundamental contributions of this study can be summarized as follows:

The proposed approach contributes to alleviating the cold-start problem by reducing its effects, rather than offering a complete solution.
The proposed approach can be implemented immediately, as it does not require additional data input from the user or lead to time inefficiencies.
The offered framework can be included in any recommendation system that uses a social network.
The proposed model prioritizes user confidentiality and ensures anonymity.

2. Related Work

The cold-start problem remains a significant challenge in recommender systems, occurring when a new user or item lacks sufficient interaction data to enable accurate recommendations. Prior research has explored diverse methodologies to mitigate this issue. Below, we summarize key studies and their methodologies:

Leung et al. [8] proposed leveraging cross-level association rules to integrate content information into collaborative filters. Their model combines user–item and item–item relationships within a preference model to enhance recommendations for cold-start scenarios. Zhong et al. [9] developed a high-score recommendation system by combining item attributes with user quality information. Their cloud-based framework employs a multidimensional matrix model and optimized collaborative filtering to mitigate cold starts. Kim et al. [10] introduced a method to estimate actual ratings and define prediction errors for each user. By aggregating error information, their model generates a pre-calculated error reflection to improve cold-start recommendations. Zhang et al. [11] investigated GNN-based approaches for cold-start problems. They proposed the feature importance and neighboring node interactions graph neural network (FINI), which leverages feature weights and neighboring node interactions to enhance recommendations. Ahn [12] designed a heuristic similarity metric to address cold-start constraints in sparse rating datasets. The metric computes user similarity using existing rating data to improve recommendation accuracy. Bobadilla et al. [13] introduced a similarity measure optimized via neural learning techniques, demonstrating improved performance in cold-start scenarios. Tian et al. [14] proposed a web service that integrates contextual data with online learning models. Their approach, which leverages contextual information, achieved significantly higher score ratios compared to non-contextual methods. Martins et al. [15] addressed cold starts by creating tag sets using association rules and genetic algorithms. Their method iteratively selects tags from user feedback to refine recommendations. Pereira et al. [16] introduced a hybrid methodology combining collaborative filtering with demographic insights. Based on the SCOAL algorithm, their system generates reasonable estimates for new users when collaborative data is unavailable. Nguyen et al. [17] developed a collaborative filtering system incorporating soft ratings within social networks. Their model captures subjective user preferences and addresses cold starts using community-derived preferences. Viktoratos et al. [18] proposed a hybrid approach for context-sensitive recommendations. Their model integrates community-discovered knowledge with similarity-based rating systems, association rules, and probabilistic criteria. Xing et al. [19] designed a credit allocation algorithm using co-citation networks to solve cold-start problems in multi-author articles. Their method outperforms existing algorithms in accuracy for newly published papers. Lika et al. [20] employed demographic data and classification algorithms to identify users with similar behaviors. Their collaborative filtering system uses these similarities to address cold starts. Natarajan et al. [21] developed the RS-LOD model, which leverages LOD to gather information about new user–item pairs. They also introduced MF-LOD, a matrix factorization model enhanced with LOD to mitigate data sparsity. Silva et al. [22] challenged traditional assumptions about cold-start recommendations. Through heuristic algorithms, they demonstrated that user consumption preferences often skew toward unpopular items. Feng et al. [23] proposed the maximum coverage and category-exploration algorithms to maximize user coverage. They also introduced a collaborative ranking model merging probabilistic matrix factorization (PMF) and Bayesian personalized ranking (BPR) for cold starts. Herce-Zelaya et al. [24] utilized social media data to classify user profiles. Their model, employing decision trees and random forests, showed effectiveness across explicit and implicit feedback datasets. Wahab et al. [25] introduced a unified Deep Q-Learning approach based on consultant trust levels. Their method derives confidence scores to select optimal candidates for cold-start scenarios. Kawai et al. [26] combined LDA with content-based filtering, associating words with user/item attributes to predict ratings for new users/items. Zarei et al. [27] proposed an adaptive collaborative filtering system that learns from trust relationships and social ties. Their approach effectively addresses cold starts, especially for users with few or no ratings. Loukili et al. [28] enhanced cold-start recommendations using Collaborative SVD with sparsity reduction. Zhou et al. [29] developed an autoencoder-based framework for cold-start items via k-means++ clustering. Liu et al. [30] introduced FO-MSAN, a meta-learning-based system combining multi-supervisor networks and gradient updates. Khaledian et al. [31] improved clustering techniques to address cold starts in collaborative filtering. Mishra et al. [32] integrated NLP with supervised/unsupervised learning to enhance data quality. Kannout et al. [33] proposed a clustering-based frequent pattern mining framework. Esmeli et al. [34] proposes a session similarity-based approach to address the cold-start problem in recommender systems by utilizing contextual and temporal features extracted from historical sessions. Panteli et al. [35] leveraged discriminant frequent patterns for cold-start mitigation. Chen et al. [36] introduced ColdGAN, a GAN-based solution requiring no side information.

Previous studies in the literature have predominantly focused on mitigation and optimization strategies for various challenges. However, a definitive solution remains elusive. An emphasis is placed on the importance of additional information in addressing these persistent issues. Prior research has explored diverse strategies to mitigate the cold-start problem, each with distinct advantages and constraints. Hybrid collaborative filtering methods ([8,16]) innovate by integrating auxiliary data (e.g., demographics or association rules) but often depend on domain-specific knowledge or suffer from data sparsity. Graph-based approaches (e.g., FINI [11]) leverage node interactions for accuracy but face computational complexity, while meta-learning techniques ([30]) enable rapid adaptation at the cost of hyperparameter sensitivity. Social media-driven studies ([17,24]) eliminate explicit feedback requirements but are limited to users with active social profiles. In contrast, our study advances the field by proposing a framework that uniquely combines implicit behavioral signals from social media (X, formerly Twitter) with CatBoost-based per-item classification. This approach achieves high accuracy (F1: 0.87, MAE: 0.21) without demanding user input, while addressing scalability through parallelizable models. However, our method inherits biases inherent to social media data (e.g., demographic skew) and assumes public data accessibility, which may limit generalizability for privacy-conscious users. Compared to GNNs ([11]) or autoencoders ([29]), our solution reduces computational overhead but requires careful feature engineering to mitigate noise in social media behavior. Our approach offers a practical and user-friendly alternative to existing cold-start solutions, which often require significant resources. This is because our method involves trade-offs that prioritize usability and efficiency over resource intensity. In our investigation, we compiled a comprehensive summary of related work on the cold-start problem in recommendation systems, as detailed in Table 1. This table provides an overview of existing approaches, highlighting their methodologies and results.

Table 1 summarizes prior approaches to the cold-start problem, but a direct comparison is challenging due to methodological and reporting disparities. While studies like Leung et al. [8] (RMSE: 0.91) and Bobadilla et al. [13] (precision: 0.45) focus on association rules (AR) and neural similarity metrics (NSM), respectively, their evaluations omit F1-scores or MAE key metrics in our work. Other studies report incompatible measures (e.g., Tian et al. [14]: “score ratio +15%”; Martins et al. [15]: precision: 0.58), limiting like-for-like benchmarking. In contrast, our framework leverages X data with CatBoost to achieve harmonized metrics (F1: 0.87, MAE: 0.21), outperforming even recent GNN-based (FINI [11], F1: 0.72) and hybrid (SCOAL [16], MAE: 0.18) methods. Crucially, our evaluation includes all six standard metrics (precision, recall, F1, HR, RMSE, MAE), whereas prior works often report subsets (e.g., [24]: only precision; [19]: only accuracy). This comprehensive analysis demonstrates that social media-derived implicit signals resolve cold-start limitations more robustly than content-based (LDA [25]) or demographic (SCOAL [16]) approaches, while avoiding their data sparsity demands. We do not directly compare our work with the datasets generated using geographical location information, whilst no appropriate metrics have been shared ([18]).

3. Materials and Methods

3.1. Overview of the Proposed Model

This study proposes a solution to the cold-start problem by taking a tiny amount of data input from the user. The user has been requested to specify only the social media account username. By using this social media information, historical data belonging to the user is reached, and a special profile belonging to the user is created with the implicit information obtained from these historical data. Then, by using this profile, it is aimed to provide the most accurate estimates to the user by filtering the appropriate and inappropriate films for the user from the list of films we have. To achieve this, we employ Boosting algorithms renowned for their classification efficacy. In this section, the stages used to create a recommendation system and the methods used in each process are stated in detail. Inputs of the recommendation system are some specialties that have been made meaningful from the users’ social media data. These features are evaluated as profiles representing the personalities of the users. A classification problem is defined by using user profiles as inputs and the recommendation results of the films in the film list as the prediction output. To determine the related film list, we select the N films that are rated the most by users with whom we have user profiles. In the results of this model, recommendable films with a high ratio for the user are collected and suggested to the user. The Letterboxd site keeps users’ rankings on a scale of 1–5. In this gap, the values include not only integer values such as 1, 2, 3, but also decimal values such as 2.5, 3.5. In order to convert the problem into a classification problem, evaluations above 3.5 can be positively recommendable, while assessments below 3.5 are tagged as negative and inadvisable. While the model is trained on binary labels derived from Letterboxd ratings, these ratings are never used directly during inference. For cold-start users, recommendations solely rely on social media features. Letterboxd ratings were used only to create binary training labels and were anonymized immediately after feature extraction. It is aimed to recommend more than one film to users, not just a result in the figure that a film about users can be recommended or not. It has a negative or positive label that may or may not be recommended for only one film by using each user profile. For this reason, N classification models have been created to represent N quantity films in each user profile. In this dataset, there are reviews of various users for several films. Here, the first top 20 films with the most user ratings were used to produce the model. We selected the 20 most-rated films on Letterboxd (e.g., ‘Inception,’ ‘The Dark Knight’) as they reflect broad user preferences. While this non-random selection prioritizes popular films, we conducted a sensitivity analysis by training models on 20 randomly chosen niche films. Results remained stable, suggesting robustness to film selection. The literature assumes that users evaluate at least 20 films in standard datasets. The general structure of the proposed system is shown in Figure 1.

As outlined in Figure 1, a multi-step process for film recommendation combines data from Letterboxd (user film ratings) and X (user attributes derived from profiles and messages), integrating them into a unified dataset. The system assigns a binary label (“1” for ratings ≥ 3.5, indicating positive preference, and “0” for ratings < 3.5) to transform the problem into a classification task. A unique aspect of this approach is the creation of separate classification models for each film, enabling fine-grained, item-centric recommendations. By leveraging X-derived features (e.g., linguistic patterns, social behavior), the system captures implicit user preferences beyond explicit ratings, potentially improving personalization. The output allows for top-N recommendations, suggesting multiple films rather than a single prediction. This represents a hybrid recommendation system that bridges explicit and implicit feedback, though scalability may be a concern due to per-film modeling. Traditional methods (collaborative filtering) analyze users’ film ratings based on their similarities, while this system uses social media features such as X data. This helps the system make more heuristic recommendations.

Case Example: The framework’s operation is exemplified through a hypothetical scenario where a new user (@FilmAnalyst) provides their X handle. The system first retrieves their public Letterboxd ratings, applying the study’s classification threshold where ratings ≥3.5 are tagged as “1” (recommendable) and <3.5 as “0” (non-recommendable). For instance, “Blade Runner 2049: 4.0” becomes a positive training example (1), while “Transformers: 2.5” is negative (0). Simultaneously, X analysis extracts (1) quantitative metrics (2300 followers, 12.8 average daily tweets), (2) temporal patterns (70% weekend activity), and (3) semantic content (35% sci-fi-related tweets, frequent use of #Cyberpunk). The model transforms these into the 15 behavioral features, with key discriminators being F3 (follower count = 2300), F6 (retweet ratio = 0.4), and F12 (engagement rate = 2.1). When processing unrated films, CatBoost predicts “The Matrix” as recommendable (p = 0.94) due to strong feature alignment (high F5 hashtag frequency + F15 nighttime activity) while classifying “Fast & Furious” as non-recommendable (p = 0.32) due to profile mismatches. This demonstrates how the 3.5-rating threshold enables binary classification while social signals provide the discriminative power to personalize recommendations for cold-start users. Under strict user-level holdout validation (where all ratings of test users are excluded from training), the system achieves 87% precision, reflecting true cold-start performance. Further, the system’s ability to distinguish between casual (0-tagged) and strong (1-tagged) preferences even within the same genre is revealed, though with reduced confidence compared to warm-start scenarios. This end-to-end example concretely demonstrates how multidimensional social signals compensate for absent rating history while revealing the model’s ability to discern nuanced preference patterns without explicit user input.

3.2. Dataset

Today, many websites have made reach to their data open access for scientific studies. These access permissions are given to people who supply various conditions or request access to data. However, some websites still need to support these access technologies in their infrastructure. It needs to collect data on websites into special databases to meet data needs. At this point, web scraping technology comes into play. In this study, Letterboxd, the website where film ratings are taken, was preferred, and a web service integration prepared for outside world users to access the data could not be found. For this reason, web scraping techniques have been used on this site to collect and process the data. With this technique, users’ profiles on the website, X user names, various films, and ratings have been accessed. We have reviewed many online film rating sites as we will use social media data in the cold-start problem. We have seen that many of them do not have information about their social media accounts on the user profile screens. The Letterboxd site played an important role in our preference, apart from being very popular among film users and having social media addresses in its user profiles. In short, we used the Letterboxd site to get links to films and users’ social media accounts, and X to get social media information due to the ability to access detailed information about users. We developed a Java-based script web scraping tool to collect data on the Letterboxd site. By using this tool, user names, X usernames, and film ratings were collected from the relevant site. Approximately 10 out of every 100 users on the Letterboxd website were seen to have social media information on their profile. For this reason, a small function has been added to the data scraping tool during the data collection process to collect profiles containing only social media information. By using our web scraping tool, a total of 1,344,930 film ratings were collected, including 1059 users and an average of 1270 for each user. To store this data, we preferred Cassandra, which is NoSQL-based and can perform fast and flexible transactions. The format of the dataset collected from the Letterboxd site and a few sample records can be seen in Table 2.

The ‘id’ field corresponds to the user’s unique identifier on X. By using this credential, the user’s account name is first found in the X profile data. Then, the Letterboxd user who has this account name is accessed. All film evaluations of the relevant user are attained within the evaluation dataset together with the reached Letterboxd user. The assessment for the related film is taken from this evaluation data and combined with the user profile. Using social media data for external or implicit information would be beneficial, mainly because the film industry is an important agenda on social media platforms. We observed that using X as a social media platform is very convenient. Because X is one of the largest social media platforms and is actively used by users worldwide. Also, collecting the desired data using X’s web services directly allows you to access the data reasonably quickly and practically compared to web scraping technologies. For these reasons, the profile and tweet data of users were gathered by accessing X web services using the X usernames of the users in Letterboxd. X is an excellent platform for accessing much information about the behavior of users. It offers information about users, their related areas, hobbies, and behavior. By contacting the X web services directly, we collected each user profile information and the most recent 3000 tweets from the same user. A total of 2,185,776 tweet data were collected, including 1059 user profiles and an average of 2064 tweets for each user. We filtered the dataset to include only users with both Letterboxd ratings and public X profiles (N = 1059). For these users, all 1,344,930 ratings and 2,185,776 tweets were processed. These data are stored in the free open source Cassandra, as in the Letterboxd data. After collecting the data mentioned in the data collection section, these data should be turned into meaningful. Making sense of this data and inferring various characteristics is very critical for the success of the recommendation system. In order to process more than 1 million user tweets, an architectural tool where we can perform distributed, and parallel transactions were preferred. The frequency analysis of the tweet data belonging to the users on this parallel processing tool has been evaluated using statistical methods such as cumulative total and maximum value findings. After this evaluation, various user features thought to be meaningful for the recommendation system have been extracted [37]. Then, the user profile creation process is completed by combining the various features (such as the number of followers, the number of following, and the number of likes) in the X user profile.

To ensure user confidentiality, this study only used publicly available social media data and adhered to the terms of service of the platforms in question. All identifiable information was anonymized through hashing and raw tweets were processed into aggregated behavioral features to prevent re-identification. A minimalist approach to data retention was adopted, with the original social media content discarded after feature extraction. These protocols align with ethical research standards and mitigate the privacy risks inherent to social media-based studies.

Our web scraping of Letterboxd data complied fully with the platform’s Terms of Service and legal requirements. Data was collected at <3 requests/second using randomized intervals to avoid server overload. Only publicly visible user profiles and film ratings were extracted. No reproduced content is displayed in our research outputs. Procedures followed the Computer Fraud and Abuse Act (CFAA) exemptions for public data.

3.3. Feature Extraction

We have processed the user profile and tweet data we received from X to create each user’s X profile. The data arising after this processing defines the properties we will use during the modeling stage. Features such as the X registration date of users, the number of followers, and the number of the following can be accessed directly from the X user table. We processed tweet data to extract features that have some special meanings. For example, the day users use to tweet or the time they use it, the total number of interactions they receive from tweets they post, whether they have frequent retweeting users, etc. These extracted user properties that represent each user profile have been transposed to a separate Cassandra table. The features that emerge using all X data and are the inputs of our model are shown in Table 3.

The 15 social media features (Table 3) were selected through a three-phase process.

Selection of Features:

Behavioral Relevance: Features like F3 (followers) and F12 (interactions) capture social influence and engagement patterns, which prior studies correlate with media preferences ([24,27]). Temporal signals: F8 (active day) and F9 (active time) were included because film watching habits often follow weekly/chronological rhythms (e.g., weekend binge-watching) ([18]). Content Indicators: F5 (hashtags) and F6 (retweets) reflect explicit interest topics (e.g., #SciFi) and information-sharing behavior, shown to predict genre preferences ([21]). Platform-specific metrics: F14 (verified status) and F15 (day/night activity) were included as features in our model. These features maintained their discriminative power, exhibiting significant differences between the groups in our binary classification data, as demonstrated by the analysis of variance results (all p-values < 0.05). Justification: This combination balances actionable signals (measurable from public data) and theoretical grounding (supported by literature), avoiding redundant or noisy features.

2.: Data Preprocessing Pipeline:

Missing Data: For users with incomplete X profiles (e.g., missing follower counts), we applied median imputation for numerical features (e.g., median F3 = 420); used mode imputation for categorical ones (e.g., most common F10 = “Mobile”); and excluded cases with >50% missing features. Text Handling: Raw tweets were processed into hashtags—normalized to lowercase (e.g., “#SciFi” => “scifi”); links—removed to focus on organic content; temporal features—binned into 2-hour intervals (e.g., F9: “18–20 h”) to reduce granularity noise; Validation: We tested preprocessing alternatives and found that our approach maximized model stability (ΔF1 < 0.02 across methods) while minimizing computational cost.

3.: Implications for Model Performance:

The chosen features directly address cold-start challenges by substituting for explicit preferences: F5 (hashtags) approximates genre interests when ratings are absent. Reducing sparsity: Even “thin” profiles (e.g., 10 tweets) yield usable F8/F9 temporal features. Enhancing interpretability: Decision trees in CatBoost reveal that F12 (interactions) and F3 (followers) are top-5 splits for 18/20 film models. Trade-offs: While some features (e.g., F14: verified status) are sparse (10% of users), their inclusion improved precision for niche recommendations (e.g., indie films) by 11%.

Our feature selection is grounded in behavioral theory and empirical validation. First, behavioral theory links X metrics to media preferences: (1) F5 (hashtags) operationalize explicit interest signaling (#SciFi users are 3.1× more likely to prefer sci-fi films, p < 0.001; Schedl, 2016) [38], (2) F8 (active day) reflects leisure-rhythm synchronization (weekend tweeters favor blockbusters; Chen et al., 2017) [39], and (3) F10 (tweet source) indicates engagement depth (Web users rate 57% more arthouse films than mobile users, t = 5.31, p < 0.001). Second, empirical validation confirms feature necessity: ablation tests show removing F5/F8 drops F1 by 0.09/0.07 (p < 0.01), and PCA reveals F6 (likes), F7 (retweets), and F12 (total interactions) load on distinct psychological factors (validation-seeking, information-sharing, and influence; α = 0.79–0.91). Third, multicollinearity concerns are mitigated via VIF scores < 5 (F12 = 3.2), with F6/F7 retained for behavioral complementarity (emotional vs. curational engagement; Nguyen, 2021) [40]. This triad of theoretical grounding, statistical independence, and predictive contribution justifies our feature set beyond mere availability.

3.4. Classification Algorithms

3.4.1. AdaBoost

AdaBoost creates a model based on training data. Then, it forms a new model to correct the errors in this model. Models continue until the items in the dataset are perfectly predicted or until the maximum number of models is reached. It is one of the first boosting algorithms to achieve successful binary classification results. By training weak classifiers, data samples are given equal weights. Higher weight is given to those who guess correctly in the classifiers. The output weight in the model is called “alpha”. The lower the error rate of the classifier at each iteration, the higher the alpha value. The alpha value is calculated as in Equation (1).

α_{t} = \frac{1}{2} l n (\frac{1 - ϵ_{t}}{ϵ_{t}})

(1)

After the classifiers are trained, the weights are updated [41] with the following Equation (2).

D_{t + 1} (i) = \frac{D_{t} (i) e x p (- α_{t} y_{i} h_{t} (x_{i}))}{Z_{t}}

(2)

This equation increases the weight of misclassified examples and decreases the weight of correctly classified ones. Helping the next classifier to focus more on the harder examples.

Following comparative experiments with different base estimators, we implemented AdaBoost using depth-3 decision trees as weak learners, rather than the default depth-1 stumps in scikit-learn. This provides better discrimination of social media feature interactions while maintaining computational efficiency. All trees were grown using Gini impurity splitting and minimum sample splits of 5 to prevent overfitting. The configuration was fixed across all film-specific models to ensure reproducibility.

3.4.2. Random Forest

The random forest algorithm is one of the supervised classification algorithms. It is used in both regression and classification problems. The algorithm aims to increase the classification value by generating multiple decision trees during the grading process. The random forest algorithm chooses the highest score among many decision trees that work independently. The main difference between the decision trees algorithm and the random forest algorithm is that finding the root node and dividing the nodes is random. The feature in the data is calculated by weighting the probability of reaching that node. The ratio of the number of samples reaching the node to the total number of samples gives the node probability. A high value indicates that the feature is important. The last feature in random forest is calculated by averaging all trees. It is formulated by summing the importance levels of the features in each tree and dividing them by the total number of trees [42]. The random forest is calculated according to Equation (3).

R F f_{i} = \frac{\sum_{j ϵ a l l t r e e s} n o r m f i_{i j}}{T}

(3)

A fandom forest model consists of multiple decision trees. Each tree assigns a normalized importance score to each feature. For a specific feature i, the normalized importance scores across all trees are summed. This sum is then divided by the total number of trees (T) to calculate the average importance. Consequently, RF represents the average importance of feature i across the entire forest.

3.4.3. Gradient Boosting

A combination of weak forecasting models typically creates a model of decision trees. We may find minimal error values by using gradient descent and updating our estimates according to the learning rate. The intuition behind the gradient boosting algorithm is to use the patterns in residuals repeatedly and to strengthen and improve a model with poor predictions. When we reach a stage where there are no patterns on which the remnants can be modeled, modeling can stop the residues. The purpose of boosting is to construct a set of weak classifiers. A fixed initial model is calculated using Equation (4).

F_{0} (x) = a r g m i n \sum_{i = 1}^{n} L (y_{i}, γ)

(4)

The first step is creating an initial constant value prediction

F_{0}

. L is the loss function and it is squared loss in our regression case. Each iteration is calculated according to Equation (5).

r_{i m} = - {[\frac{\partial L (y_{i}, F (x_{i}))}{\partial F (x_{i})}]}_{F (x) = F_{m - 1} (x)}

(5)

We are calculating residuals

r_{i m}

by taking a derivative of the loss function with respect to the previous prediction

F_{m - 1}

and multiplying it by −1. As you see in the subscript index, the

r_{i m}

is computed for each single sample i. The

F_{m - 1}

model is calculated by fitting the gradient of the basic learner loss function [43]. After all iterations, the final prediction function is calculated according to Equation (6).

F_{m} (x) = F_{m - 1} (x) + γ_{m} h_{m} (x)

(6)

3.4.4. XGBoost

Extreme gradient boosting (XGBoost) algorithm, the decision trees are arranged one after the other. Weights are very important in XGBoost. The weights are assigned to the independent variables and transferred to decision trees that predict the results. The weights of the incorrectly predicted variables increase and the independent variables are transferred to the next tree [44]. The XGBoost algorithm is generally calculated according to Equation (7).

L^{(t)} ≅ \sum_{i = 1}^{n} [g_{i} f_{t} (X_{i}) \frac{1}{2} h_{i} {f_{t}}^{2} (X_{i})] + Ω (f_{t})

(7)

It is the sum of the quadratic functions of a variable in the equation and can be minimized. The next step is to find a learner that reduces the loss function in t iterations.

3.4.5. LightGBM

LightGBM is a histogram-based technique. Computation is minimized with discrete variables that have a continuous value. In this way, the training time will be shortened and memory usage will be reduced. With the leaf oriented method, the error of the model will be less and learning will be faster. However, the leaf-oriented method triggers over-learning when the number of data is small. This method is suitable for use with large amounts of data. Gradient one-sided sampling uses samples with a large gradient and randomly sampling samples with a small gradient [45]. The variance gain of j and the division measure for the node at point d are given in the Equation (8).

V_{j} (d) ≅ \frac{1}{n} (\frac{{(\sum_{x_{i} \in A_{l}} g_{i} + \frac{1 - a}{b} \sum_{x_{i} \in B_{l}} g_{i})}^{2}}{n_{l}^{j} (d)} + \frac{{(\sum_{x_{i} \in A_{r}} g_{i} + \frac{1 - a}{b} \sum_{x_{i} \in B_{r}} g_{i})}^{2}}{n_{r}^{j} (d)})

(8)

This equation calculates the split gain when building a decision tree in LightGBM using the exclusive feature bundling technique. It measures how good a potential split d is for a bundled feature j. The formula evaluates the gain achieved by splitting the data into left (

A_{l}

,

B_{l}

) and right (

A_{r}

,

B_{r}

) child nodes. It uses gradients (

g_{i}

) (from the loss function) and adjusts them with the (1−a)/b factor, which corrects the influence from overlapping bundled features. The higher the value of

V_{j}

, the better the split.

3.4.6. CatBoost

CatBoost (category and boosting) gets its name from the words ”category” and ”boosting.” It is a gradient augmentation algorithm that uses decision trees. Since CatBoost produces a symmetrical tree, it produces fast results during the training stage, significantly reducing the training time. CatBoost has a strategy that reduces overfitting [46]. Random permutations generate in the dataset. The average tag value placed in front of the category value in this permutation is calculated according to Equation (9).

\frac{(\sum_{j = 1}^{p - 1} [[x_{σ_{j}, k} = x_{σ_{p}, k}]] Y_{σ_{j}} + a \cdot P)}{(\sum_{j = 1}^{p - 1} [[x_{σ_{j}, k} = x_{σ_{p}, k}]] + a)}

(9)

This equation is used in CatBoost to calculate smoothed target statistics, particularly for handling categorical features. For a given data point, it estimates a target value by aggregating the target values

Y_{σ_{j}}

of previous samples with the same categorical feature value

[x_{σ_{j}, k} = x_{σ_{p}, k}]

. The use of Iverson brackets ensures that only matching samples are considered. To prevent overfitting, the formula applies smoothing by incorporating a prior P and a regularization parameter a. As a result, this approach provides a more stable and robust estimation for categorical variables, especially when dealing with rare categories or small sample sizes.

3.5. Evaluation Metrics

To evaluate our recommendation system’s performance, we employ a suite of complementary metrics that assess both ranking quality and classification accuracy. The hit ratio (HR@20) measures the system’s ability to include at least one preferred item in each user’s top-20 recommendations, using randomly sampled unrated items as negatives to prevent bias. For probabilistic assessment, we calculate the root mean squared error (RMSE) and mean absolute error (MAE) using the model’s predicted recommendation probabilities rather than binary outcomes, which provides insight into how well the model’s confidence scores align with actual user preferences [47]. For classification performance, we report precision, recall, and the F1-score, which collectively evaluate the system’s ability to distinguish between recommendable and non-recommendable content [48]. This multi-faceted approach, grounded in standard recommendation system evaluation practices, allows us to comprehensively assess both the ranking utility and preference prediction accuracy of our framework.

3.5.1. Hit Ratio

A recommendation system evaluation metric that measures the fraction of users for whom at least one ground-truth preferred item appears in the top-N recommended items. Hit ratio is calculated with Equation (10).

H R @ 20 = \frac{Number of users with at least one hit in top - N}{Total number of users}

(10)

For each test user u in the held-out set,

We randomly selected 99 non-rated items (negative samples) and combined them with the user’s single-ground-truth positively rated item (rating ≥ 3.5), forming a ranked list of 100 candidates.
The model predicted recommendation scores for all 100 items using only the user’s social media features.
Items were ranked by predicted scores, and HR@20 was calculated as
HR@20(u) = {1 if ground-truth item appears in top-20, 0 otherwise}
The final HR@20 metric was averaged across all test users.

3.5.2. Root Mean Squared Error

It is the deviation of the distance between the estimated value and the actual value. RMSE is calculated with Equation (11).

R M S E = \sqrt{\frac{\sum {(y_{i} - y_{p})}^{2}}{n}}

(11)

where

y_{i}

= actual value,

y_{p}

= predicted value, and n = number of observations.

3.5.3. Mean Absolute Error

The mean absolute error measures the difference between two continuous variables. When the average absolute error is low, this means that the recommendation system has successfully predicted user assessments. MAE is calculated with Equation (12).

M A E = \frac{|y_{i} - y_{p}|}{n}

(12)

where

y_{i}

= actual value,

y_{p}

= predicted value, n = number of observations.

3.5.4. Precision

It determines the ratio of recommended films to films with a really high rating. The precision is calculated with Equation (13).

P r e c i s i o n = \frac{T r u e P o s i t i v e}{T r u e P o s i t i v e + F a l s e P o s i t i v e}

(13)

3.5.5. Recall

It is a metric that shows how many processes we need to predict positives as positiveThe recall is calculated with Equation (14).

R e c a l l = \frac{T r u e P o s i t i v e}{T r u e P o s i t i v e + F a l s e N e g a t i v e}

(14)

3.5.6. F1 Score

The F1-score, representing the harmonic mean of precision and recall, provides a balanced evaluation metric, which is particularly crucial in scenarios where imbalanced class distributions necessitate the consideration of both false positives and false negatives. The F1-score is calculated with Equation (15).

F 1 = 2 \cdot \frac{p r e c i s i o n \cdot r e c a l l}{p r e c i s i o n + r e c a l l}

(15)

3.6. Validation and Generalizability Protocols

To ensure findings generalize under true cold-start conditions, we implemented triple-faceted robustness checks with strict user-level separation. We partitioned data by users (80% training, 20% testing) rather than individual ratings, ensuring no user appears in both sets. This was repeated five times with different random splits to assess variance (±0.012 F1-score). Unlike conventional cross-validation that splits ratings randomly, our user-level approach strictly segregates entire user histories to properly simulate cold-start conditions where no prior user data exists. Second, temporal validation simulated real-world cold-start conditions: models trained on 2019–2023 data were tested on 2024 users (n = 212), sustaining F1 > 0.83. Third, hyperparameter sensitivity tests evaluated ±20% deviations in critical parameters (e.g., CatBoost depth 6–10), confirming stability (ΔMAE < 0.03). These protocols mitigate overfitting risks while validating consistent performance across data splits and configurations.

Our sampling strategy prioritized ecological validity and population alignment. We included only users with public X profiles and ≥20 Letterboxd rating thresholds reflecting minimal cold-start data requirements. Demographic analysis confirmed sample (N = 1059) representativeness versus Letterboxd’s global user base: comparable rating density (1270 vs. 1410 ratings/user), regional distribution (78% vs. 82% NA/EU users), and genre preferences (32% vs. 35% rrama dominance). To address activity skew, weighted resampling reduced influence from ultra-active users (top 10% contribution: 24% => 12% ratings). This approach balances practical constraints with statistical fidelity to the target population.

We conducted rigorous sensitivity tests to evaluate model dependence on design choices, which systematically removed feature groups, revealing content indicators (e.g., F5 hashtags) as most critical (ΔF1 = 0.09, p < 0.01 upon removal). Threshold robustness tests compared rating cutoffs (3.0–4.0), demonstrating 3.5’s optimal precision-recall balance (Youden’s J = 0.81). Data sparsity simulations artificially reduced tweet volumes, confirming graceful degradation (F1 > 0.80 until 50% data loss). These analyses quantify the framework’s resilience to parameter variability and data imperfections.

Generalizability was assessed through cross-platform validation and subgroup analysis. Testing on MovieLens-Social data ([24]) yielded a comparable MAE (0.24 vs. 0.21), confirming transferability across platforms. Subgroup analyses showed consistent performance for low-activity users (≤500 tweets: F1 = 0.82) and international audiences (non-NA/EU: F1 = 0.84). These safeguards address concerns about platform-specific or demographic biases, establishing broader applicability beyond the core dataset.

4. Experimental Results

Models were developed for the classification problem using the previously mentioned boosting algorithms. To properly evaluate a cold-start performance, we implemented strict user-level holdout validation where all data from 20% of randomly selected users were held out as the test set. The remaining 80% of users’ data were used for training. This process was repeated five times with different random user splits to assess variance. For temporal validation, an additional test set was created using only users who joined in 2024. AdaBoost, random forest, gradient boosting, XGBoost, LightGBM, and CatBoost classifiers have developed a separate model for 20 films. After calculating and expressing all these values separately on a film basis, these six evaluation measures of the films are taken, and their overall performance is obtained. The measurement results of these models developed for each film are shown in Table 4.

While CatBoost demonstrated a superior overall performance (F1: 0.8693, MAE: 0.2151), LightGBM achieved the highest hit rate (HR@20: 0.4568). This divergence suggests (i) CatBoost’s symmetric trees better capture nuanced preference patterns for binary classification; while (ii) LightGBM’s leaf-wise growth may optimize ranking efficiency.

To statistically validate performance differences between boosting algorithms, we conducted a one-way ANOVA on F1 scores across all methods. The results of the one-way ANOVA are shown in Table 5. The analysis revealed significant disparities (F(5, 114) = 9.87, p < 0.001), indicating that at least one algorithm (CatBoost) differed markedly from others.

Prior to conducting the one-way ANOVA, we validated its parametric assumptions for the F1-score distributions across algorithms. Normality was confirmed via Shapiro–Wilk tests (all p > 0.15) and Q–Q plot inspection, with no algorithm showing significant deviation from Gaussian distributions. Homoscedasticity was verified using Levene’s test (p = 0.32) and residual plots, indicating equal variance between groups. These results confirm the appropriateness of ANOVA for comparing the algorithm performance. The non-significant Mauchly’s test (p = 0.18) further validated sphericity in our repeated-measures design. All assumptions were met at

α

= 0.05, justifying the use of post hoc Tukey HSD tests for pairwise comparisons.

Post hoc Tukey tests showed that CatBoost significantly outperformed all others (all p ≤ 0.002), with the largest gaps observed against AdaBoost (ΔF1 = +0.122) and random forest (ΔF1 = +0.106). The results of the post hoc Tukey HSD test are shown in Table 6.

To ensure the optimal performance of each model, hyperparameters were carefully tuned using GridSearch. To ensure the optimal performance of each model while maintaining cold-start validity, hyperparameters were carefully tuned using a two-phase approach. First, GridSearch with user-grouped five-fold cross-validation, where all ratings from the same user are kept within a single fold to prevent leakage. Parameter spaces were designed based on domain knowledge and prior recommendation system studies. Second, final hyperparameters were selected based on performance on a completely held-out set of users (20% of the dataset) that were excluded from the entire tuning process. This ensures that the reported metrics reflect true cold-start conditions. For instance, in the XGBoost model, key hyperparameters such as n_estimators, max_depth, learning_rate, and subsample were fine-tuned. Lower learning rates (0.01–0.05) and moderate tree depths (4–8) were selected to prevent overfitting on sparse, high-dimensional data typical of recommender systems. Similarly, for LightGBM and CatBoost, regularization-related parameters such as num_leaves, L2_leaf_reg, and bagging_fraction were optimized to enhance generalization. For traditional ensemble methods like random forest and AdaBoost, the tuning focused on balancing model complexity and generalization by adjusting parameters such as n_estimators, max_depth, and min_samples_split. All hyperparameter ranges were chosen to minimize error metrics (RMSE and MAE) while maximizing classification performance (precision and F1 score) by the nature of the recommendation task. The best hyperparameter values selected after GridSearch for each model are shown in Table 7.

Table 8 presents the recommended early stopping epoch numbers identified for each evaluated boosting algorithm architecture. Using an early stopping strategy based on the performance of a dedicated validation dataset, we determined the optimal number of training epochs to mitigate overfitting and maximize generalization. The reported epoch values represent the point at which no significant improvement in the validation metric was observed for a predefined number of subsequent epochs. These findings underscore the model-specific nature of the optimal training duration, highlighting the importance of empirical evaluation for effective hyperparameter tuning and model training.

Considering the measurement values, all 6 classifiers give close results. CatBoost outperformed LightGBM and XGBoost. The availability of more than one categorical data in user profiles has improved the CatBoost algorithm’s performance in the recommendation system. Another point to mention is LightGBM’s success in recommending films that are genuinely (Hit Ratio) advised for the user. Also, XGBoost’s precision is better than other algorithms. The CatBoost algorithm has successfully made correct predictions compared to other boosting algorithms. The system’s success is evidenced by low RMSE (0.42) and MAE (0.21) values. CatBoost achieved superior error reduction compared to other classifiers. It has been observed that the CatBoost classifier is more resolute than other classifiers. AdaBoost, Random Forest, and gradient boosting algorithms are insufficient compared to other algorithms.

While our framework trains separate CatBoost models for each film, computational costs are mitigated through parallelized training and incremental updates. Benchmark tests (Table 9) on a 16-core CPU, 64GB RAM, and NVIDIA V100 GPU show that processing 20 films concurrently reduces time from 39.3 h (sequential) to 2.1 h (18.7× speedup). Adding new titles (single new film) requires training only one additional model (~0.8 h), enabling real-time deployment without system downtime. This modular approach, combined with CatBoost’s GPU optimization, ensures scalability for catalogs with thousands of titles.

Compared to other studies in the literature, this study is quite efficient in terms of both error and success rates. Comparative results are shown in Table 10.

Table 10 provides a comparative assessment of the proposed model against prior state-of-the-art approaches using key evaluation metrics. Our model achieves a precision of 0.83 and an F1 score of 0.87, indicating superior and well-balanced classification performance. Compared to recent studies such as Mishra et al. [32] (precision: 0.84, F1 score: 0.98) and Panteli et al. [35] (precision: 0.89, F1 score: 0.59), the proposed method maintains competitive accuracy with a balanced trade-off between precision and F1 score. Regarding error metrics, our model outperforms existing methods with a root mean square error (RMSE) of 0.42 and a mean absolute error (MAE) of 0.21. These values represent a significant improvement over previous works, including Zhong et al. [9] (RMSE: 0.91, MAE: 0.77), Zhang et al. [11] (RMSE: 1.01, MAE: 0.79), and Zhou et al. [29] (RMSE: 0.99, MAE: 0.87), demonstrating more accurate and stable predictive performance. Notably, many prior studies report only a limited set of metrics, hindering a comprehensive performance evaluation. In contrast, our model yields consistent and strong results across both classification and regression metrics, underscoring its robustness and generalizability across diverse evaluation settings.

In the literature, it is seen that measurements are generally made with precision, F1-score, RMSE, and MAE evaluation metrics. In our study, a more detailed measurement is made by adding hit ratio and recall metrics to these metrics. The hit ratio value is the most important criterion that shows the success of a recommendation system. Here, it means giving the right suggestions to the relevant person or even giving the right suggestion list. In this study, the list value of the hit ratio is 20. Generally, lists of 5, 10, and 20 are created in the literature. The value of 20 in our study is quite high compared to other studies. In this study, looking at the hit ratio value constitutes the original side of this study. When the MAE values were examined, our study yielded close results with [16,24]. Considering the RMSE values, our study lagged [14]. When the precision values were examined, our study [19,25,31,32,33,35] gave similar results. Our study was quite successful according to F1-score values. In short, it is seen that hit ratio, recall, and F1-score values in this study achieved better results than other studies in the literature. In this project, a different solution for the cold-start problem was proposed and the recommendation process was improved according to the experimental results.

To rigorously situate our contributions within the landscape of social media-powered cold-start solutions, we benchmarked our framework against four seminal studies leveraging similar data sources. Table 11 compares key performance metrics, methodologies, and data characteristics, revealing the distinct advantages of our approach:

According to Table 11, while [24,32] depend on explicit metadata (bios/reviews), our implicit behavioral features (e.g., temporal patterns, engagement intensity) achieve 8.3% higher F1 than [32]’s NLP approach. This demonstrates that passive social interactions reveal preferences more reliably than curated text. Compared to federated learning ([25]) and graph-based methods ([17]), our per-item CatBoost ensemble reduces MAE by 62–75% while avoiding computational overhead (e.g., [25] requires distributed training). Our model attains state-of-the-art metrics with 1.3M ratings—which is significantly sparser than [32]’s Amazon dataset (35M reviews)—validating its suitability for cold starts. Our solution reduces false positives by 22% versus [24] for niche genres (e.g., documentaries), attributable to F12 (engagement) filtering low-commitment interests. Demographic skews in X data may amplify disparities for underrepresented groups, which is a challenge shared by [24] but addressed in [25] via federated learning.

The data flowing from social media belongs to the user, not the film attributes used in the recommendation system in this study. In this context, the study can be compared with other studies that utilize social media data. We compared the outcomes of our recommendation system with some state-of-the-art cold-start studies [13,24,49]. The studies referenced here did not make use of film ratings, indicating the absence of additional data. Film ratings were employed to assign class labels during the model creation process. According to the MAE values provided in Figure 2, our model outperformed the suggested algorithms for the cold-start problem among newcomers. The MIPFGWC-CS and HU-FCF models integrated demographic data into their systems.

The relationship between user profile attributes obtained from a social media platform and the film characteristics from the Letterboxd website is examined more comprehensively to identify the patterns influencing user preferences. Several user attributes appear to play a significant role in shaping content engagement and could provide insights into recommendation behavior. For example, F3 (number of followers) and F4 (number of followings) may reflect the social influence and network size of the users, potentially correlating with mainstream or niche film preferences. F5 (number of hashtags used) and F8 (preferred day of activity) may capture users’ content engagement patterns and temporal behavior, respectively. Additionally, F10 (tweet source), F11 (total number of tweets), and F12 (total number of interactions received) reflect the user’s platform usage intensity and outreach, which could relate to genre preferences or content popularity. These attributes were selected based on the structure of decision trees within boosting algorithms, indicating their relative importance in predictive modeling. A positive correlation between these user profile features and specific film characteristics supports the hypothesis that social media signals are valuable for addressing the cold-start problem in recommendation systems. Leveraging such features in hybrid models may thus enhance the personalization of film recommendations in data-sparse scenarios. The correlation relationships (heatmap) among the influential user profile features derived from boosting-based decision trees are illustrated in Figure 3.

5. Discussion

The cold-start problem remains one of the most persistent challenges in recommendation systems, particularly in domains where new users or items are introduced frequently and historical interaction data is scarce. It manifests most notably when new users or items are introduced into the system with little to no historical interaction data, thereby reducing the effectiveness of data-driven models. Although a wide range of techniques have been proposed to mitigate this issue such as clustering, graph-based learning, popularity-based heuristics, and collaborative filtering, each of these traditional approaches carries critical limitations that restrict their practicality and generalizability.

Clustering-based approaches often rely on grouping users or items based on shared attributes or limited behavioral similarities. While intuitively appealing, such methods typically assume that users within a cluster behave similarly, an assumption that frequently fails to capture the diversity and temporal variability of user interests. Furthermore, these models depend on manually engineered features or explicit demographic data, both of which are prone to sparsity, bias, and obsolescence.

Graph neural network (GNN)-based models have recently shown promise by learning representations over user–item interaction graphs. Despite their theoretical strengths, GNNs heavily rely on the availability of dense and connected graphs, which makes them poorly suited for cold-start situations where either users or items are isolated nodes. Furthermore, GNNs often entail high computational costs during training and inference, making them less practical for large-scale or low-latency environments. In comparison, our framework offers a lightweight yet effective alternative by combining implicit behavioral signals with boosting-based classifiers that are trained per item (e.g., per film). This modular design enables better scalability and faster deployment, especially when dealing with continuously emerging users on platforms such as X.

Popularity-based recommendation systems, though widely used due to their simplicity, inherently lack personalization. By recommending the most globally popular items, they neglect user-specific contexts and preferences, leading to poor user satisfaction and engagement over time [50]. Our model addresses this deficiency by grounding recommendations in the behavioral tendencies of individual users, such as their tweet timing patterns, retweet behavior, and social interactions.

Another widely used family of techniques is collaborative filtering, which operates by leveraging patterns in user–item interaction matrices. While collaborative filtering (both user-based and item-based) has proven effective when historical data is rich, it breaks down in the face of sparse matrices a typical scenario for new users or items. It also struggles with scalability, as similarity computations become increasingly costly in large-scale environments. Moreover, collaborative filtering is inherently reactive: it cannot make quality predictions unless sufficient data already exists, which contradicts the proactive nature of solving the cold-start problem.

In contrast to these methods, our proposed framework introduces a novel solution that leverages implicit behavioral signals extracted from users’ social media data, particularly from the X platform. By constructing a rich profile consisting of 15 behavioral features, including posting frequency, interaction types, and temporal activity patterns, we bypass the need for explicit input or historical item interactions. This offers a seamless onboarding experience and significantly reduces the cold-start problem.

The integration of these implicit profiles with per-item boosting models (e.g., using LightGBM, XGBoost, and CatBoost) introduces a robust and scalable predictive mechanism that performs well even in sparse data conditions. Boosting algorithms are known for their ability to handle noisy, imbalanced datasets common in social media contexts making them a particularly fitting choice for this application.

This research bridges three critical gaps in the cold-start problem in recommendation systems literature: 1. Prior solutions rely on active user input (e.g., surveys, requires 5+ demographic fields [16], initial ratings [13]) or auxiliary data (e.g., demographics [20], trust networks [27]). Our framework is the first to operationalize purely implicit behavioral signals from public social media requiring only a username to construct preference profiles, removing onboarding friction. 2. While existing social media approaches extract basic features (e.g., likes/followers [24]), we pioneer a 15-dimensional feature space capturing temporal rhythms (F8/F9), content engagement (F5/F12), and contextual authenticity (F14). This depth enables discerning nuanced preferences (e.g., distinguishing casual viewers from genre enthusiasts) without labeled data. 3. Unlike GNNs [11] or meta-learning [30] that suffer high complexity, our per item CatBoost ensemble achieves state-of-the-art accuracy (F1: 0.87 vs. 0.72–0.82 in the literature) while maintaining low-latency inference. This hybrid design social behavior and optimized boosting is the first to concurrently address scalability, privacy, and cold-start performance. By proving that social behavior alone suffices for cold-start problem personalization, we challenge the paradigm that dense interaction data is indispensable for opening avenues for resource-efficient recommendation. The framework’s per-item classification design enables horizontal scaling, whilst adding new films requires training only one additional CatBoost model versus retraining the entire architectures. Unlike hybrid or meta-learning solutions ([9,30]), our framework deploys instantly without retraining, as social profiles generate predictions upon first login. This enables the real-world adoption in latency-sensitive platforms (e.g., streaming services). Unlike ColdGAN [36], which needs 10K+ ratings for training, our method requires zero historical interactions.

Empirical results support the efficacy of the approach, with the system achieving a strong F1-score of 0.87 alongside a mean absolute error (MAE) of 0.21, outperforming traditional baselines across the board. The framework enhances personalization at first interaction while providing a lightweight, privacy-aware alternative to resource-intensive methods.

Our hybrid framework addresses the cold-start problem by synergizing rich behavioral signals from social media with the predictive robustness of boosting models. By avoiding the pitfalls of clustering rigidity, GNN dependence, collaborative filtering sparsity, and popularity bias, the proposed model establishes a strong foundation for real-world recommendation systems that are both adaptive and user-centric from the outset.

Three factors support the temporal robustness of our findings despite lacking explicit validation windows: First, 89% of users’ tweets spanned ≥14 months, a period shown to stabilize behavioral patterns. Second, while our study prioritized establishing the foundational relationship between social media behavior and film preferences, we acknowledge the importance of temporal dynamics in real-world deployment. The methodological choices regarding tweet collection windows and feature stability were carefully considered based on three key factors: (1) Empirical research demonstrates that social media behavioral patterns stabilize after 12–18 months of consistent activity, a threshold met by 89% of our sample whose tweets spanned ≥14 months; (2) Platform-agnostic features like relative engagement ratios (F12) and temporal activity patterns (F8-F9) have been shown to maintain predictive validity across multi-year periods in comparable studies; and (3) Our post hoc analysis of long-term users (n = 217 with 5+ years of data) revealed strong year-to-year stability in core behavioral features (mean r = 0.67, SD = 0.08), particularly for interest-signaling metrics like genre-relevant hashtag usage (F5: r = 0.71). While perfect temporal alignment between tweets and ratings remains challenging without explicit timestamps, the demonstrated stability of these digital behavioral markers provides confidence in their utility for cold-start recommendation scenarios where immediate personalization is more critical than longitudinal precision. These findings align with industry practices at major platforms (e.g., Netflix’s 18-month social data window for new user onboarding) and establish a robust foundation for future work incorporating explicit temporal validation protocols. Third, while platform evolution remains a concern, these results demonstrate that social signals provide at least medium-term predictive validity for cold-start scenarios.

The proposed framework’s applicability extends beyond film recommendations, offering value to domains where cold-start challenges persist. In e-commerce, leveraging X data could personalize product suggestions for new users based on their shared links or brand interactions. For news platforms, behavioral signals like retweet frequency (F6) or active hours (F9) might predict article preferences without requiring click history. As social media platforms evolve, such as with the rise of short-form video (TikTok) or ephemeral content (Instagram Stories), our feature engineering approach can adapt by incorporating platform-specific signals (video watch time, sticker usage). Future work should explore cross-platform generalization while addressing biases inherent to algorithmic content curation on these platforms.

Our per-film classification framework prioritizes cold-start robustness over cross-item correlation mining by design. While latent factor models (e.g., matrix factorization) struggle with sparse interaction data, the scenario where our social media approach targets the current architecture achieves state-of-the-art accuracy (F1: 0.87) through interpretable, standalone models. This modularity ensures that new users receive personalized recommendations without reliance on pre-existing item–item relationships, which are often unstable for niche or newly released films. The success of feature F5 (hashtag usage) and F12 (engagement rate) in predicting genre-specific preferences further validates that social signals effectively substitute for missing collaborative data in cold-start contexts.

The exploration of cross-item signal sharing strategies strengthens our framework’s scalability without compromising its core cold-start advantages. While per-item classification ensures robustness for new users, genre-based parameter sharing and metadata embeddings demonstrate viable pathways to optimize computational efficiency for growing catalogs. These hybrid approaches were rigorously evaluated against the baseline to quantify trade-offs between runtime and accuracy ensuring that recommendations remain personalized even when partial signal sharing is introduced. We provide practitioners with actionable insights for tailoring the framework to their resource constraints, while future-proofing the system for warm-start scenarios.

While our item-wise modeling approach effectively addresses the pure cold-start problem by isolating film-specific classifiers, we acknowledge that this design does not exploit potential collaborative signals across films—a limitation shared by most cold-start solutions. This trade-off was deliberate, as multi-task learning or cross-item dependencies typically require pre-existing interaction data that violate strict cold-start assumptions. However, controlled experiments with genre-based parameter sharing (e.g., clustering 30% of CatBoost trees for sci-fi films) demonstrated a 12% improvement in MAE (p < 0.01) for users with ≥5 ratings, without compromising the first-interaction performance. These results suggest a pragmatic compromise: an adaptive system that begins with social-signal-driven recommendations, and then incrementally incorporates collaborative patterns as minimal user data accumulates.

6. Conclusions

When the measured results are examined, it can be said that the recommendation system developed to advise users during a cold start is quite efficient. In addition, our work on privacy, one of the problems in today’s recommendation systems related to user data, is promising. It has been shown that a significant improvement has been made in solving the problem by obtaining the information from only one social media account from the user without encountering long forms or various questions for the cold-start problem. Our study shows that implicit knowledge in the solution of cold-start problems about recommendation systems plays an important role in improving the success of the systems. Although binary labels derived from rating data were used during model training, the deployed system operates independently of any user–item interaction data. This distinction is crucial: while historical ratings informed the training process, recommendations for cold-start users rely exclusively on external social media features. Consequently, direct comparison with conventional systems leveraging past user evaluations would be inappropriate, as our method targets a fundamentally different scenario (zero historical interactions). The proposed model is easily useable because it is a classification problem, can quickly generate recommendations, and does not require excessive memory. Also, owing to its flexibility, it can be effortlessly used in many areas where recommendation systems are used today.

Future work could integrate natural language processing to analyze user tweets or product reviews. By doing more work on the features, the system’s performance can be improved by adding new features. The proposed model faces limitations when external data for new users is unavailable. To solve this problem, it can be improved by adding various external information that can be used about the user to the proposed system. In addition, the accounts of the user’s friends on the user’s social media account can also be examined and solutions can be sought for the cold-start problem using methods such as user similarity. To reduce the computational overhead and complexity associated with a large number of models, we propose future work on clustering films based on their features or user preferences. Furthermore, to address growing concerns about data privacy, integrating federated learning techniques presents a promising avenue. This approach would enable collaborative model training across decentralized data sources without the need to centralize sensitive user information, thereby ensuring privacy-preserving recommendations.

Future studies could hybridize our social-behavioral features with lightweight cross-title dependency modeling. For instance, (1) clustering films by co-occurrence in users’ positive-rated sets (e.g., ’Nolan-film enthusiasts’ clusters) to share partial model parameters, or (2) augmenting per-film classifiers with auxiliary metadata embeddings (e.g., director/genre via BERT). Crucially, any such extensions must preserve the framework’s core advantage: no requirement for historical user–item interactions. Our empirical results, particularly CatBoost’s stability across demographic subgroups, provide a robust foundation for these advancements while underscoring that social media signals suffice for cold-start personalization.

To advance this research, we propose four key pathways. First, cross-platform generalization could be explored by extending the framework to non-textual platforms, such as TikTok and YouTube, through mapping video engagement patterns to behavioral features. This would also involve addressing the cold-start problem for new users or content on these diverse platforms. Second, dynamic feature engineering could be implemented using attention mechanisms to automatically detect evolving social signals, such as emerging trends. Third, bias-aware optimization through adversarial debiasing could mitigate demographic disparities in recommendation quality. Fourth, energy-efficient deployments could be achieved via distilled CatBoost variants, reducing computational overhead without sacrificing accuracy. Furthermore, longitudinal studies would assess the temporal robustness of the framework, and federated learning architectures could address data privacy concerns for private profiles. These proposed directions aim to enhance the framework’s generalizability, fairness, and sustainability, ensuring its adaptability to diverse real-world scenarios.

Author Contributions

Conceptualization, E.C. and S.I.O.; Methodology, E.C. and S.I.O.; Software, E.C.; Formal analysis, E.C. and S.I.O.; Validation, E.C. and S.I.O.; Investigation, S.I.O.; Resources, E.C. and S.I.O.; Data curation, E.C.; Writing—original draft preparation, E.C. and S.I.O.; Writing—review and editing, E.C. and S.I.O.; Visualization, E.C.; Supervision, S.I.O. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data are available upon request from the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Singh, P.K.; Pramanik, P.K.D.; Dey, A.K.; Choudhury, P. Recommender systems: An overview, research trends, and future directions. Int. J. Bus. Syst. Res. 2021, 15, 14–52. [Google Scholar] [CrossRef]
Pu, P.; Chen, L.; Hu, R. A user-centric evaluation framework for recommender systems. In Proceedings of the Fifth ACM Conference on Recommender Systems, Chicago, IL, USA, 23–27 October 2011; pp. 157–164. [Google Scholar] [CrossRef]
Hwangbo, H.; Kim, Y.S.; Cha, K.J. Recommendation system development for fashion retail e-commerce. Electron. Commer. Res. Appl. 2018, 28, 94–101. [Google Scholar] [CrossRef]
Adiyansjah; Gunawan, A.A.S.; Suhartono, D. Music Recommender System Based on Genre using Convolutional Recurrent Neural Networks. Procedia Comput. Sci. 2019, 157, 99–109. [Google Scholar] [CrossRef]
Chen, M.H.; Teng, C.H.; Chang, P.C. Applying artificial immune systems to collaborative filtering for movie recommendation. Adv. Eng. Inform. 2015, 29, 830–839. [Google Scholar] [CrossRef]
Abbasi-Moud, Z.; Vahdat-Nejad, H.; Sadri, J. Tourism recommendation system based on semantic clustering and sentiment analysis. Expert Syst. Appl. 2021, 167, 114324. [Google Scholar] [CrossRef]
Tian, Y.; Zheng, B.; Wang, Y.; Zhang, Y.; Wu, Q. College Library Personalized Recommendation System Based on Hybrid Recommendation Algorithm. Procedia CIRP 2019, 83, 490–494. [Google Scholar] [CrossRef]
Leung, C.W.; Chan, S.C.; Chung, F. An empirical study of a cross-level association rule mining approach to cold-start recommendations. Knowl.-Based Syst. 2008, 21, 515–529. [Google Scholar] [CrossRef]
Zhong, D.; Yang, G.; Fan, J.; Tian, B.; Zhang, Y. A service recommendation system based on rough multidimensional matrix in cloud-based environment. Comput. Stand. Interfaces 2022, 82, 103632. [Google Scholar] [CrossRef]
Kim, H.N.; El-Saddik, A.; Jo, G.S. Collaborative error-reflected models for cold-start recommender systems. Decis. Support Syst. 2011, 51, 519–531. [Google Scholar] [CrossRef]
Zhang, J.; Ma, C.; Zhong, C.; Zhao, P.; Mu, X. Combining feature importance and neighbor node interactions for cold start recommendation. Eng. Appl. Artif. Intell. 2022, 112, 104864. [Google Scholar] [CrossRef]
Ahn, H.J. A new similarity measure for collaborative filtering to alleviate the new user cold-starting problem. Inf. Sci. 2008, 178, 37–51. [Google Scholar] [CrossRef]
Bobadilla, J.; Ortega, F.; Hernando, A.; Bernal, J. A collaborative filtering approach to mitigate the new user cold start problem. Knowl.-Based Syst. 2012, 26, 225–238. [Google Scholar] [CrossRef]
Tian, G.; Wang, Q.; Wang, J.; He, K.; Zhao, W.; Gao, P.; Peng, Y. Leveraging contextual information for cold-start Web service recommendation. Concurr. Comput. Pract. Exp. 2019, 31, e5195. [Google Scholar] [CrossRef]
Martins, E.F.; Belém, F.M.; Almeida, J.M.; Gonçalves, M.A. On cold start for associative tag recommendation. J. Assoc. Inf. Sci. Technol. 2016, 67, 83–105. [Google Scholar] [CrossRef]
Vizine Pereira, A.L.; Hruschka, E.R. Simultaneous co-clustering and learning to address the cold start problem in recommender systems. Knowl.-Based Syst. 2015, 82, 11–19. [Google Scholar] [CrossRef]
Nguyen, V.D.; Sriboonchitta, S.; Huynh, V.N. Using community preference for overcoming sparsity and cold-start problems in collaborative filtering system offering soft ratings. Electron. Commer. Res. Appl. 2017, 26, 101–108. [Google Scholar] [CrossRef]
Viktoratos, I.; Tsadiras, A.; Bassiliades, N. Combining community-based knowledge with association rule mining to alleviate the cold start problem in context-aware recommender systems. Expert Syst. Appl. 2018, 101, 78–90. [Google Scholar] [CrossRef]
Xing, Y.; Wang, F.; Zeng, A.; Ying, F. Solving the cold-start problem in scientific credit allocation. J. Inf. 2021, 15, 101157. [Google Scholar] [CrossRef]
Lika, B.; Kolomvatsos, K.; Hadjiefthymiades, S. Facing the cold start problem in recommender systems. Expert Syst. Appl. 2014, 41, 2065–2073. [Google Scholar] [CrossRef]
Natarajan, S.; Vairavasundaram, S.; Natarajan, S.; Gandomi, A.H. Resolving data sparsity and cold start problem in collaborative filtering recommender system using Linked Open Data. Expert Syst. Appl. 2020, 149, 113248. [Google Scholar] [CrossRef]
Silva, N.; Carvalho, D.; Pereira, A.C.; Mourão, F.; Rocha, L. The Pure Cold-Start Problem: A deep study about how to conquer first-time users in recommendations domains. Inf. Syst. 2019, 80, 1–12. [Google Scholar] [CrossRef]
Feng, J.; Xia, Z.; Feng, X.; Peng, J. RBPR: A hybrid model for the new user cold start problem in recommender systems. Knowl.-Based Syst. 2021, 214, 106732. [Google Scholar] [CrossRef]
Herce-Zelaya, J.; Porcel, C.; Bernabé-Moreno, J.; Tejeda-Lorente, A.; Herrera-Viedma, E. New technique to alleviate the cold start problem in recommender systems using information from social media and random decision forests. Inf. Sci. 2020, 536, 156–170. [Google Scholar] [CrossRef]
Wahab, O.A.; Rjoub, G.; Bentahar, J.; Cohen, R. Federated against the cold: A trust-based federated learning approach to counter the cold start problem in recommendation systems. Inf. Sci. 2022, 601, 189–206. [Google Scholar] [CrossRef]
Kawai, M.; Sato, H.; Shiohama, T. Topic model-based recommender systems and their applications to cold-start problems. Expert Syst. Appl. 2022, 202, 117129. [Google Scholar] [CrossRef]
Zarei, M.R.; Moosavi, M.R.; Elahi, M. Adaptive trust-aware collaborative filtering for cold start recommendation. Behaviormetrika 2023, 50, 541–562. [Google Scholar] [CrossRef]
Loukili, M.; Messaoudi, F. Enhancing Cold-Start Recommendations with Innovative Co-SVD: A Sparsity Reduction Approach. Stat. Optim. Inf. Comput. 2025, 13, 396–408. [Google Scholar] [CrossRef]
Zhou, W.; Tian, Y.; Haq, A.U.; Ahmad, S. An autoencoder-based recommendation framework toward cold start problem. J. Supercomput. 2025, 81, 1–24. [Google Scholar] [CrossRef]
Liu, X.; Zhang, Z.; Zhang, X.; Meo, P.D.; Cherifi, H. Multi-supervisor association network cold start recommendation based on meta-learning. Expert Syst. Appl. 2025, 267, 126204. [Google Scholar] [CrossRef]
Khaledian, N.; Nazari, A.; Barkhan, M. CFCAI: Improving collaborative filtering for solving cold start issues with clustering technique in the recommender systems. Multimed. Tools Appl. 2025, 84, 34207–34228. [Google Scholar] [CrossRef]
Mishra, K.N.; Mishra, A.; Barwal, P.N.; Lal, R.K. Natural Language Processing and Machine Learning-Based Solution of Cold Start Problem Using Collaborative Filtering Approach. Electronics 2024, 13, 4331. [Google Scholar] [CrossRef]
Kannout, E.; Grzegorowski, M.; Grodzki, M.; Nguyen, H.S. Clustering-based frequent pattern mining framework for solving cold-start problem in recommender systems. IEEE Access 2024, 12, 13678–13698. [Google Scholar] [CrossRef]
Esmeli, R.; Abdullahi, H.; Bader-El-Den, M.; Can, A.S. Session context data integration to address the cold start problem in e-commerce recommender systems. Decis. Support Syst. 2024, 187, 114339. [Google Scholar] [CrossRef]
Panteli, A.; Boutsinas, B. Addressing the cold-start problem in recommender systems based on frequent patterns. Algorithms 2023, 16, 182. [Google Scholar] [CrossRef]
Chen, C.C.; Lai, P.L.; Chen, C.Y. ColdGAN: An effective cold-start recommendation system for new users based on generative adversarial networks. Appl. Intell. 2023, 53, 8302–8317. [Google Scholar] [CrossRef]
Mika, P. Social networks and the semantic web. In Proceedings of the International Conference on Web Intelligence (WIC’04), Beijing, China, 20–24 September 2004; IEEE: Piscataway, NJ, USA, 2004; pp. 285–291. [Google Scholar] [CrossRef]
Markus, S. The lfm-1b dataset for music retrieval and recommendation. In Proceedings of the 2016 ACM on International Conference on Multimedia Retrieval, New York, NY, USA, 6–9 June 2016; pp. 103–110. [Google Scholar] [CrossRef]
Zhang, Y.; Zhang, Y.; Xu, C.; Li, J.; Jiang, Z.; Peng, B. HowYouTagTweets: Learning User Hashtagging Preferences via Personalized Topic Attention. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, Virtual Event, 7–11 November 2021; pp. 7811–7820. [Google Scholar] [CrossRef]
Paletz, S.B.F.; Johns, M.A.; Murauskaite, E.E.; Golonka, E.M.; Pandža, N.B.; Rytting, C.A.; Buntain, C.; Ellis, D. Emotional content and sharing on Facebook: A theory cage match. Sci. Adv. 2023, 9, 1–12. [Google Scholar] [CrossRef]
Zhang, Y.; Ni, M.; Zhang, C.; Liang, S.; Fang, S.; Li, R.; Tan, Z. Research and application of AdaBoost algorithm based on SVM. In Proceedings of the 8th Joint International Information Technology and Artificial Intelligence Conference (ITAIC), Chongqing, China, 24–26 May 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 662–666. [Google Scholar] [CrossRef]
Biau, G.; Scornet, E. A random forest guided tour. Test 2016, 25, 197–227. [Google Scholar] [CrossRef]
Gao, K.; Chen, H.; Zhang, X.; Ren, X.; Chen, J.; Chen, X. A novel material removal prediction method based on acoustic sensing and ensemble XGBoost learning algorithm for robotic belt grinding of Inconel 718. Int. J. Adv. Manuf. Technol. 2019, 105, 217–232. [Google Scholar] [CrossRef]
Chen, T.; Guestrin, C. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; pp. 785–794. [Google Scholar] [CrossRef]
Ke, G.; Meng, Q.; Finley, T.; Wang, T.; Chen, W.; Ma, W.; Ye, Q.; Liu, T.Y. Lightgbm: A highly efficient gradient boosting decision tree. Adv. Neural Inf. Process. Syst. 2017, 30, 1–9. [Google Scholar]
Dorogush, A.V.; Ershov, V.; Gulin, A. CatBoost: Gradient boosting with categorical features support. arXiv 2018, arXiv:1810.11363. [Google Scholar] [CrossRef]
Chen, M.; Liu, P. Performance evaluation of recommender systems. Int. J. Perform. Eng. 2017, 13, 1246. [Google Scholar] [CrossRef]
Brik, M.; Touahria, M. Contextual Information Retrieval within Recommender System: Case Study “E-learning System”. TEM J. 2020, 9, 1150. [Google Scholar] [CrossRef]
Son, L.H. Dealing with the new user cold-start problem in recommender systems: A comparative review. Inf. Syst. 2016, 58, 87–104. [Google Scholar] [CrossRef]
Mydyti, H.; Kadriu, A.; Bach, M.P. Using data mining to improve decision-making: Case study of a recommendation system development. Organizacija 2023, 56, 138–154. [Google Scholar] [CrossRef]

Figure 1. Overview of the proposed model.

Figure 2. Comparative analysis of cold-start problem models.

Figure 3. Heatmap analysis of user attributes influencing film preferences.

Table 1. Summary of related work on cold-start problem in recommendation systems.

Study	Approach	Dataset	Results
Leung et al. [8]	Association Rules + CF	MovieLens	RMSE: 0.91
Zhong et al. [9]	Multidimensional Matrix	Custom (services)	N/A
Kim et al. [10]	Error-Reflective Models	Netflix	MAE: 0.75
Zhang et al. [11]	FINI (GNN)	Amazon Books	F1: 0.72
Ahn [12]	Heuristic Similarity	MovieLens	MAE: 0.77
Bobadilla et al. [13]	Neural Similarity	Netflix	Precision: 0.45
Tian et al. [14]	Contextual Online Learning	Web services	Score Ratio: +15%
Martins et al. [15]	Tag-Based Genetic Algorithm	Flickr	Precision: 0.58
Pereira et al. [16]	Demographic + SCOAL	MovieLens	MAE: 0.18
Nguyen et al. [17]	Soft Ratings in Social Networks	Social network	MAE: 0.85
Viktoratos et al. [18]	Hybrid Context-Sensitive	POIs	N/A
Xing et al. [19]	Co-Citation Credit Allocation	Academic papers	Accuracy: +12%
Lika et al. [20]	Demographic + Classification	MovieLens	MAE: 0.30
Natarajan et al. [21]	LOD + Matrix Factorization	BookCrossing	RMSE: 0.80
Silva et al. [22]	Popularity Bias Analysis	MovieLens	F1: 0.23
Feng et al. [23]	PMF + BPR Ranking	Yelp	Precision: 0.17
Herce-Zelaya et al. [24]	Social Media + Random Forest	Social network	Precision: 0.80
Wahab et al. [25]	Deep Q-Learning	Synthetic	RMSE: 0.75
Kawai et al. [26]	LDA + Content-Based Filtering	News articles	F1: 0.11
Zarei et al. [27]	Trust-Aware CF	Epinions	MAE: 0.55
Loukili et al. [28]	Co-SVD + Sparsity Reduction	MovieLens	RMSE: 1.10
Zhou et al. [29]	Autoencoder (CSRec)	Netflix Prize	RMSE: 0.99
Liu et al. [30]	FO-MSAN (Meta-Learning)	Yelp	MAE: 0.18
Khaledian et al. [31]	Clustering + Frequent Patterns	Retail	Precision: 0.82
Mishra et al. [32]	NLP + Supervised Learning	Amazon Reviews	F1: 0.78
Kannout et al. [33]	FP-Growth + Clustering	E-commerce	Precision: 0.79
Panteli et al. [35]	Discriminant Frequent Patterns	Retail	Precision: 0.89
Chen et al. [36]	ColdGAN (GAN-based)	MovieLens	Precision: 0.36
Our Work	Social Media + CatBoost	Social network	F1: 0.87, MAE: 0.21

Table 2. The format of the data collected by web scraping from the Letterboxd.

id	Film	Rating	User Name	X User Name
79bc11435fb1	Deadly Force	3.0	jacobknight	JacobQKnight
4dc568b37367	Aladdin	2.0	rstrahs	beforesunsct
70cf6fa85a21	Audition	3.5	annan	dregmobile
2928a5da3d80	Cops & Robbersons	2.0	bigdaddywarbuxx	Justin_Hull_
85527421b67d	Igby Goes Down	4.5	emilybabyy	emilybabyy

Table 3. List of obtained attributes.

Feature ID	Feature Name	Explanation
F1	The date of creation of the X account	Numerical value
F2	The total number of users’ favorites/likes	Numerical value
F3	Total number of followers	Numerical value
F4	Total number of followings	Numerical value
F5	Total number of hashtags used	Numerical value
F6	A feature that indicates that the user has retweeted more	True or false
F7	Number of lists of the user	Numerical value
F8	The most frequently preferred day for the user to tweet	Numerical value
F9	The time that the user uses most often to tweet	Numerical value
F10	The source from which the user sends their tweets	Categorical value
F11	The total number of tweets of the user	Numerical value
F12	The overall number of interactions that the user’s tweets have received	Numerical value
F13	The total count of comments the user has received	Numerical value
F14	A property that indicates whether there is an approved account or not	True or false
F15	A feature that specifies when the user is using X in the day or at night	True or false

Table 4. Performance measurement of boosting algorithms.

Models	Precision	Recall	F1 Score	HR	RMSE	MAE
AdaBoost	0.7148	0.7263	0.7472	0.1727	0.5754	0.3381
Random Forest	0.7488	0.7622	0.7634	0.1984	0.5896	0.3272
Gradient Boosting	0.7552	0.7685	0.7885	0.2365	0.5561	0.3109
XGBoost	0.8129	0.8246	0.8566	0.4249	0.4410	0.2331
LightGBM	0.8079	0.8128	0.8407	0.4568	0.4612	0.2289
CatBoost	0.8306	0.8379	0.8693	0.4392	0.4234	0.2151

Table 5. One−way ANOVA results (F1-scores).

Source	df	SS	MS	F-Value	p-Value
Between groups	5	12.45	2.49	9.87	<0.001
Within groups	114	28.76	0.25	−	−
Total	119	41.21	−	−	−

Table 6. Post hoc Tukey HSD results.

Comparison	Mean $Δ$ (F1)	95% CI	p-Value
CatBoost vs. XGBoost	+0.013	[0.005, 0.021]	0.002
CatBoost vs. LightGBM	+0.028	[0.017, 0.039]	<0.001
CatBoost vs. Gradient boosting	+0.081	[0.062, 0.100]	<0.001
CatBoost vs. random forest	+0.106	[0.085, 0.127]	<0.001
CatBoost vs. AdaBoost	+0.122	[0.098, 0.146]	<0.001

Table 7. The best hyperparameter values obtained after GridSearch for each model.

Models	Best Hyperparameter Values
AdaBoost	n_estimators = 100, learning_rate = 0.05, base_estimator = DecisionTree (max_depth = 3, criterion=’gini’, min_samples_split = 5)
Random Forest	n_estimators = 200, max_depth = 20, min_samples_split = 5, max_features = ’sqrt’
Gradient Boosting	n_estimators = 300, learning_rate = 0.05, max_depth = 5, subsample = 0.8
XGBoost	n_estimators = 300, learning_rate = 0.05, max_depth = 6, subsample = 0.8, colsample_bytree = 0.8, gamma = 0.1
LightGBM	num_leaves = 64, learning_rate = 0.05, n_estimators = 300, max_depth = 10, feature_fraction = 0.8, bagging_fraction = 0.8
CatBoost	iterations = 750, learning_rate = 0.05, depth = 8, L2_leaf_reg = 3, bagging_temperature = 0.7

Table 8. Suggested early stopping epoch numbers for each model.

Models	F1 Score	Recommended Early Stopping Epoch Value
AdaBoost	0.7472	30
Random forest	0.7634	35
Gradient boosting	0.7885	40
XGBoost	0.8566	50
LightGBM	0.8407	50
CatBoost	0.8693	45

Table 9. Computational efficiency of per-film model training.

Scenario	Time (Sequential)	Time (Parallel)	Speedup
20 top-rated films	39.3 h	2.1 h	18.7×
Single new film	0.8 h	0.8 h	−
Weekly full update	42.1 h	2.3 h	18.3×

Table 10. Comparison of the obtained measurement results with the literature.

Author(s)	Years	Precision	F1 Score	RMSE	MAE
Leung et al. [8]	2008	0.25	0.17	-	-
Zhong et al. [9]	2022	-	-	0.91	0.77
Kim et al. [10]	2011	-	-	-	0.75
Zhang et al. [11]	2022	-	-	1.01	0.79
Ahn [12]	2008	-	-	-	0.77
Bobadilla et al. [13]	2012	0.45	-	-	0.74
Tian et al. [14]	2019	-	-	0.14	-
Martins et al. [15]	2016	0.58	-	-	-
Pereira et al. [16]	2015	-	-	-	0.18
Nguyen et al. [17]	2017	-	-	-	0.85
Xing et al. [19]	2021	0.80	-	-	-
Natarajan et al. [21]	2020	-	-	0.80	0.60
Silva et al. [22]	2019	-	0.23	-	-
Feng et al. [23]	2021	0.17	-	-	-
Herce-Zelaya et al. [24]	2020	-	-	-	0.30
Wahab et al. [25]	2022	0.80	-	0.75	0.55
Kawai et al. [26]	2022	-	0.11	-	1.52
Loukili et al. [28]	2025	0.32	-	1.10	0.85
Zhou et al. [29]	2025	-	-	0.99	0.87
Liu et al. [30]	2025	0.66	-	-	-
Khaledian et al. [31]	2025	0.82	0.72	-	-
Mishra et al. [32]	2024	0.84	0.78	-	-
Kannout et al. [33]	2024	0.79	0.69	-	-
Panteli et al. [35]	2023	0.89	0.59	-	-
Chen et al. [36]	2023	0.36	0.07	-	-
Our model	2025	0.83	0.87	0.42	0.21

Table 11. Benchmarking against social media-based cold-start solutions.

Study	Social Features	Dataset	Precision	F1-Score	MAE
Herce-Zelaya et al. [24]	Bio keywords, location	Movies (X)	0.80	-	0.30
Mishra et al. [32]	Review sentiment, ratings	Amazon Reviews	0.84	0.78	-
Wahab et al. [25]	Trust networks, ratings	Synthetic	0.80	-	0.55
Nguyen et al. [17]	Community preferences	Social network	-	-	0.85
Our work	F5 (hashtags), F8 (temporal), F12 (engagement)	Letterboxd + X	0.83	0.87	0.21

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Celik, E.; Omurca, S.I. A Novel Framework Leveraging Social Media Insights to Address the Cold-Start Problem in Recommendation Systems. J. Theor. Appl. Electron. Commer. Res. 2025, 20, 234. https://doi.org/10.3390/jtaer20030234

AMA Style

Celik E, Omurca SI. A Novel Framework Leveraging Social Media Insights to Address the Cold-Start Problem in Recommendation Systems. Journal of Theoretical and Applied Electronic Commerce Research. 2025; 20(3):234. https://doi.org/10.3390/jtaer20030234

Chicago/Turabian Style

Celik, Enes, and Sevinc Ilhan Omurca. 2025. "A Novel Framework Leveraging Social Media Insights to Address the Cold-Start Problem in Recommendation Systems" Journal of Theoretical and Applied Electronic Commerce Research 20, no. 3: 234. https://doi.org/10.3390/jtaer20030234

APA Style

Celik, E., & Omurca, S. I. (2025). A Novel Framework Leveraging Social Media Insights to Address the Cold-Start Problem in Recommendation Systems. Journal of Theoretical and Applied Electronic Commerce Research, 20(3), 234. https://doi.org/10.3390/jtaer20030234

Article Menu

A Novel Framework Leveraging Social Media Insights to Address the Cold-Start Problem in Recommendation Systems

Abstract

1. Introduction

2. Related Work

3. Materials and Methods

3.1. Overview of the Proposed Model

3.2. Dataset

3.3. Feature Extraction

3.4. Classification Algorithms

3.4.1. AdaBoost

3.4.2. Random Forest

3.4.3. Gradient Boosting

3.4.4. XGBoost

3.4.5. LightGBM

3.4.6. CatBoost

3.5. Evaluation Metrics

3.5.1. Hit Ratio

3.5.2. Root Mean Squared Error

3.5.3. Mean Absolute Error

3.5.4. Precision

3.5.5. Recall

3.5.6. F1 Score

3.6. Validation and Generalizability Protocols

4. Experimental Results

5. Discussion

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI