1. Introduction
The rapid growth of e-commerce has significantly altered consumer behavior by providing the convenience of a wide selection of products and a personalized shopping experience. A key technology supporting this transformation is the recommender system, which helps users to discover relevant products by analyzing their preferences, behavior, and contextual data. Product reviews and ratings are valuable sources of user information because they contribute to product credibility by providing subjective feedback. Lackermair et al. [
1] demonstrated that online reviews and ratings directly impact purchase intentions, establishing them as a primary source of consumer decision-making. However, due to the volume and unstructured nature of this feedback, users often struggle to process it efficiently. Changchit et al. [
2] further emphasized that integrating customer reviews with product specifications enhances customer satisfaction and influences purchasing decisions.
Although review data is valuable, it is insufficient to solve all recommendation challenges, especially those involving new users with limited historical interactions. To address this cold-start problem, we propose a recommendation approach that integrates three market-oriented product attributes: sales volume, sales period, and user satisfaction. The user satisfaction score is calculated using sentiment analysis of product reviews and rating distributions. These attributes are combined using a weighted ranking model with threshold-based filtering. It enables the system to recommend products that are both popular and have positive customer reviews.
Many recommender systems, particularly those on e-commerce platforms, are based on collaborative filtering (CF). However, the effectiveness of CF is often influenced by user search and browsing behavior. Illm et al. [
3] pointed out that CF performance degrades when users explore broad product categories or lack specific interests. To address data sparsity and improve accuracy, researchers have extensively studied hybrid approaches that combine CF with content-based filtering. Çano et al. [
4] demonstrated that these combinations effectively mitigate the cold-start problem and enhance personalization. Additionally, Sarwar et al. [
5] showed that item-based CF methods outperform user-based methods in terms of scalability and robustness.
Recent advancements in deep learning have further expanded the capabilities of recommender systems. Deep neural networks (DNNs), which are commonly used for image recognition and natural language processing, can now model user–item interactions. These models can capture complex nonlinear relationships and learn from implicit feedback data. According to Ko et al. [
6], the use of deep learning has improved recommender performance by addressing issues such as data sparsity and hidden user preferences. However, challenges remain, particularly regarding model scalability and interpretability. As consumer behavior becomes more dynamic and context-dependent, recommendation algorithms must incorporate both implicit behavioral signals and explicit market indicators.
In this study, we propose a dual-framework recommender system that addresses both cold-start and data-rich scenarios. For new users without purchase histories, we propose a ranking-based approach for recommending products that considers product attributes, such as sales volume, sales period, and user satisfaction. For existing users with historical interactions, we develop a bias neural collaborative filtering (BNCF) model that incorporates item-level popularity factors into the recommendation process. The detailed workflow and model architecture are described in
Section 3.
In summary, our contributions are twofold:
We propose a ranking-based recommendation approach for new users that integrates sentiment analysis with sales-oriented features. This approach helps to overcome the cold-start problem without relying on user history.
We introduce a bias-aware neural collaborative filtering model that incorporates product-level popularity factors to improve recommendation performance for users with prior interactions.
The remainder of this paper is structured as follows:
Section 2 reviews the related work on traditional and deep learning–based recommender systems, emphasizing sentiment analysis and hybrid models.
Section 3 describes the proposed recommendation approaches for both new and existing users, including feature extraction and model design.
Section 4 presents the experimental results, including performance evaluations, ablation studies, and hyperparameter analyses.
Section 5 discusses the findings and compares the proposed methods. Finally,
Section 6 concludes the paper and outlines directions for future research.
3. Dual-Approach Recommender System Design
This section outlines our recommendation strategies for two types of users: new users without purchase histories and returning users with purchase records. Different methods are necessary to address the diversity of user behavior on e-commerce platforms and to improve the accuracy and relevance of product recommendations.
Poriya et al. [
24] distinguish between non-personalized and user-based collaborative filtering recommender systems. Non-personalized systems use aggregate metrics, such as product popularity or average ratings, to generate recommendations that are the same for all users. In contrast, user-based collaborative filtering uses algorithms, such as the Pearson correlation coefficient, to analyze similarities between users and recommend products based on the preferences of similar users.
In this study, we propose a dual recommendation strategy. For new users without purchase histories, we utilize a content- and behavior-based approach that integrates product popularity and user feedback. For users with existing purchase data, we apply a neural collaborative filtering model that incorporates user and product bias factors. Our goal is to use natural language processing and deep learning techniques to provide personalized product recommendations to both user types, thereby enhancing the overall shopping experience.
The following subsections describe our approach in detail.
Section 3.1 presents the recommendation method for new users. It explains the data selection and preprocessing steps, defines the evaluation metrics of product performance, and introduces the composite recommendation formula.
Section 3.2 covers the method for returning users, including data acquisition, constructing the BNCF model, and extracting user- and product-specific features.
3.1. Product Recommendation Method for New Users Without Purchase Histories
This section focuses on recommendation methods for new users who have recently registered on the platform but have not yet made any purchases. In such cases, the absence of purchase behavior and user-specific attributes makes it hard to infer preferences through conventional personalized approaches. Therefore, we propose a method that generates recommendations using objective product performance indicators, such as sales data, user reviews, and market longevity.
Figure 1 illustrates the overall workflow of our product ranking system for cold-start users.
In general, products with high sales figures tend to have widespread market acceptance, reflecting qualities such as high quality, good value, and satisfactory performance. Similarly, products that maintain high sales over extended periods demonstrate sustained customer satisfaction and reliability. To capture these aspects, we introduce a recommendation strategy that evaluates products based on three dimensions: sales volume, sales period, and user satisfaction.
We constructed a multidimensional hybrid recommendation model that integrates content-based sentiment analysis with sales and temporal data. This model represents each product in a three-dimensional space, where the axes correspond to the normalized values of sales volume, sales period, and user satisfaction. Products are plotted in this space according to their performance across these attributes. A product that scores highly on all three dimensions occupies a position in the ideal region of this space, visualized in
Figure 2 as a red cube. Our recommendation algorithm prioritizes these products, aiming to present users with options that are both popular and well-reviewed over time.
The methodology begins with the collection and preparation of product metadata. When direct sales data is unavailable, we define proxy metrics for sales volume and sales period using the number and timing of user reviews. We also apply sentiment analysis to review texts to quantify user satisfaction. These metrics are then integrated into a unified recommendation score that ranks products according to their overall performance across all three dimensions.
3.1.1. Sales Volume Scoring
Due to the unavailability of direct sales data on the Rakuten website, we used the number of user reviews as a proxy for sales volume. Previous studies have demonstrated a strong correlation between the number of reviews and the number of purchases [
25,
26]. Based on this relationship, we calculated a sales volume score (
(
a)) for a target product
a using the total number of reviews.
However, we acknowledge the limitations of using review count as a practical proxy for sales volume. Depending on factors such as product type, price, longevity on the platform, and customer engagement patterns, users may be more or less likely to leave reviews. It can introduce bias into sales estimates. For example, products that elicit strong positive or negative opinions, or that are part of incentivized review campaigns, may receive a disproportionate number of reviews. Although we acknowledge the possibility of bias, using review count is commonly accepted and necessary when direct sales data is unavailable. Future work could evaluate methods that incorporate review conversion rates or supplement this proxy with additional behavioral signals to create a more accurate representation of sales velocity.
To classify products, we used the average number of reviews across all products as a threshold value. We chose this data-driven approach because it provides a simple, replicable, and objective baseline for distinguishing between popular and less popular items in our dataset. It is especially crucial since there are no established industry benchmarks for this product category. Products that met or exceeded this threshold value were labeled as having “high sales volume,” while those that fell below were labeled as “low sales volume.” Using this criterion, all products are appropriately classified based on their sales volume. While a formal sensitivity analysis to test different threshold values was not conducted as part of this study, we acknowledge that exploring the impact of this threshold on model performance is a valuable direction for future work.
3.1.2. Sales Period Scoring
Similar to sales volume, the Rakuten website did not provide direct data on product availability periods. Therefore, we estimated the sales period of each product by calculating the time interval between its first and last review. We used this interval as the estimated sales duration of each product.
We then assigned a sales period score ((a)) to the target product a. Products were classified as having either a “long” or “short” sales period, using the average sales period across all products as the benchmark.
3.1.3. User Satisfaction Scoring
We assessed user satisfaction by combining review sentiment and user ratings. We performed a sentiment analysis on the collected reviews using the Google Cloud Natural Language API (
https://cloud.google.com/natural-language) (accessed on 22 June 2023), which assigns sentiment scores ranging from
(most negative) to
(most positive). To ensure reproducibility,
Table 1 lists the key configuration settings used for the sentiment analysis. Unless otherwise specified, we used the default settings provided by the Google Cloud Natural Language API. The study was conducted in English in document mode using the analyzeSentiment method.
Based on standard practices in sentiment analysis and the API’s documentation, we categorized reviews into three groups: negative ( to ), neutral ( to ), and positive ( to ). These thresholds are widely accepted for partitioning sentiment scores into distinct emotional classes. Each review was classified accordingly, and the sentiment score was combined with the user rating to derive a satisfaction score.
We tested four methods of calculating a satisfaction score for each product by combining sentiment scores and user ratings, as determined by a preliminary experiment [
27]. The most effective method is to multiply the sentiment score by the user rating for each review and then aggregate these values. The final satisfaction score for the target product
a is calculated using sentiment scores and user ratings, as defined in Equation (
1):
Here, , , and represent the positive, negative, and neutral sentiment scores, respectively, for the ith review (with absolute values for and ). is the corresponding user rating. The sum is computed over all reviews n for product a. Similar to our sales volume scoring, we used the average satisfaction score across all products as a data-driven threshold to categorize products. Products with satisfaction scores above the average were classified as “high satisfaction” and others as “low satisfaction.” Using this criterion, all target products were categorized accordingly. Future work could include a sensitivity analysis to evaluate how different satisfaction thresholds might influence the final recommendation rankings.
3.1.4. Product Recommendation Ranking
Traditional recommendation models often rely solely on individual attributes, such as product category, brand, or user ratings. While these methods are beneficial, they may overlook key behavioral and temporal signals. To improve relevance and fairness, we integrated sales volume, sales period, and satisfaction into a unified ranking model.
Since these attributes have different scales, where satisfaction ranges from 0 to 1, sales volume spans a wide numerical range, and sales periods extend from 0 to 365 days, we then normalized each dimension to ensure equal influence. The final recommendation score (
) for the target product
a is calculated as shown in Equation (
2):
Our choice regarding a multiplicative aggregation for the composite score is deliberate. We hypothesize that the top recommended product for a new user should excel in all three dimensions: popularity (sales volume), longevity (sales period), and quality (user satisfaction). A multiplicative approach ensures that products with low scores in any of these areas will be appropriately penalized in the final ranking. For example, a product with high sales volume but poor user satisfaction should not receive a high recommendation. Multiplication naturally captures this ‘AND’ relationship between attributes. In contrast, an additive model allows a high score in one dimension to compensate for a low score in another.
This composite score reflects a product’s market performance, customer satisfaction, and longevity. Products are ranked based on this score to provide new users with balanced recommendations that consider both objective metrics and subjective feedback.
While this multiplicative model provides a robust and interpretable baseline, we acknowledge that its heuristic nature is a limitation. We could explore alternative aggregation methods. For example, a weighted linear combination, , would enable more precise control over the importance of each attribute. Furthermore, incorporating advanced learnable or adaptive weighting mechanisms, such as those based on linear regression or neural networks, could improve performance because the optimal weighting scheme may depend on the user or the context. Exploring such adaptive approaches is a promising direction for future research.
3.2. Product Recommendation Method for Existing Users with Purchase Histories
Collaborative filtering is a widely used strategy for users with an established purchase history. It uses historical data on user–product interactions to improve recommendation accuracy. To further enhance recommendation performance, we propose a bias neural collaborative filtering (BNCF) model. The BNCF model is based on the traditional neural network-based collaborative filtering (NCF) framework and incorporates three product-related bias factors: sales volume, sales period, and user satisfaction. The architecture of the BNCF model is shown in
Figure 3.
First, we calculate the bias scores for each product using the methodology described in
Section 3.1. Then, we incorporate these scores into the BNCF model alongside the neural feature extraction layers. The BNCF model consists of embedding, hidden, and output layers. These layers are designed to capture latent user–product interactions. Finally, we calculate the recommendation score by combining interaction features and bias-based scores. We train the model using a Bayesian personalized ranking (BPR) loss function, which models user preferences through pairwise ranking and enables personalized optimization. The remainder of this subsection provides a detailed explanation of the BNCF model design and training process.
3.2.1. Construction of BNCF Model
The neural collaborative filtering (NCF) model employs embedding layers to map users and products onto a shared low-dimensional latent space. It also uses a multilayer perceptron (MLP) to model complex nonlinear interactions. Our BNCF model builds on this by incorporating product and user bias information into the NCF framework, thereby improving the accuracy of personalized recommendations. The architecture is depicted in
Figure 4.
The BNCF model consists of two main components:
- 1.
Extraction of User and Product Latent Features
First, we initialize the user and product embedding vectors. Each vector is
n-dimensional. Then, we pass these embeddings through two fully connected layers to increase feature expressiveness. Specifically,
These vectors are processed through two-layer neural networks with ReLU activation to produce refined feature vectors,
and
, which are defined as shown in Equations (
3) and (
4), respectively:
In our study, and represent the final hidden layer outputs for user u and item i, respectively.
ReLU (rectified linear unit) is a commonly used activation function, defined as .
and are the embedding vectors for user u and item i, respectively.
and denote the weight matrices for the first and second layers of the user network.
and represent the bias terms for the first and second layers of the user network.
and denote the weight matrices for the first and second layers of the item network.
and represent the bias terms for the first and second layers of the item network.
After obtaining the feature vectors
and
for the user and the item, respectively, the first interaction score (
) is obtained by computing the dot product of these two vectors as shown in Equation (
5).
- 2.
Incorporation of User and Product Bias Features
Each user is assigned a three-dimensional bias embedding vector:
Each product is similarly represented using three normalized bias values corresponding to sales volume, sales period, and user satisfaction:
Next, the second interaction score (
) is computed by taking the dot product of the user bias vector (
) and the product bias vector (
). It reflects the bias in the user–product relationship. It improves the accuracy and flexibility of the model as an additional component of the final recommendation score in Equation (
6).
The final prediction score is the sum of the two components mentioned above, as shown in Equation (
7).
This formulation allows the model to generate more accurate context-aware recommendations by using learned latent features and explicit bias signals.
3.2.2. Bayesian Personalized Ranking Loss
To optimize the BNCF model, we used the Bayesian personalized ranking (BPR) loss function [
28], which is suitable for tasks involving implicit feedback recommendations. BPR formulates preference learning as a ranking problem, ensuring that observed (positive) items are ranked higher than unobserved (negative) items for a given user.
The objective of the training is to maximize the difference in preference between positive and negative item pairs using a sigmoid-based loss function. For a given user
u, a positive item
i, and a negative item
j, the BPR loss is defined as in Equation (
8):
This formulation enables the model to optimize personalized rankings by updating latent factors that directly reflect user preferences.
3.2.3. Model Training
We set the embedding dimensionality to and incorporated normalized sales volume, sales period, and user satisfaction as bias inputs. During training, we generated user–item triplets (user u, positive item i, and negative item j) and trained the model to satisfy the inequality .
The model was initialized with random parameters and optimized using the Adam optimizer [
29]. The hyperparameters were configured as follows:
Learning rate: 0.001;
Weight decay: 0.00001;
Mini-batch size: 256;
Epochs: 200.
The model parameters were updated using mini-batch stochastic gradient descent and backpropagation until convergence was reached.
Figure 5 illustrates the training flow.
To evaluate the model’s performance, we conducted experiments under two settings: one that incorporated user bias and one that did not. For each positive sample in the training data (i.e., a purchased product), we generated a negative sample (i.e., a randomly unpurchased product). Training continued until the loss stabilized.
We conducted the evaluation using two widely adopted metrics:
NDCG@k (Normalized Discounted Cumulative Gain) evaluates both the relevance and ranking position of recommended items. We report NDCG@5 and NDCG@10 to measure the quality of the top five and top ten recommendations, respectively.
HR@k (Hit Rate) is the proportion of recommended items with which the user engaged among the top k items. We used HR@5 and HR@10 to evaluate the model’s ability to recall recommendations for lists of different sizes.
Combining NDCG and HR provides a more comprehensive evaluation of the recommender system. NDCG focuses on ranking quality, while HR evaluates coverage and alignment with user interests. Together, these metrics offer valuable insights into the system’s accuracy, relevance, and user experience.
4. Model Architecture Analysis and Evaluation
In this section, we conducted a systematic evaluation of the proposed model’s architecture. The evaluation proceeded in two phases. First, we adjusted the model’s structure and parameters and then trained it using a designated training dataset. In the second phase, we evaluated the trained models using a test dataset and conducted a comparative analysis to identify the most effective architecture. First, we implemented a baseline model without item bias and compared it to an extended model that incorporated item bias factors. Based on the results, we experimented with varying the number of negative samples, the number of hidden layers, and the dimensionality of the embedding vectors. We applied these steps to the training and testing procedures to select the optimal model configuration.
4.1. Dataset and Preprocessing
The dataset used in this study was sourced from the Rakuten Ichiba data available in the “Rakuten Dataset,” provided by Rakuten Group, Inc. through the IDR Dataset Service of the National Institute of Informatics (NII) [
30]. The dataset spans the entirety of 2019 and contains monthly records across five product categories: bags, home appliances, cosmetics, women’s fashion, and shoes.
For each product, we extracted relevant information, including product name, product ID, genre ID, user rating, review content, and review timestamp. The dataset also includes user ID and product-level metadata, such as the number of reviews. We applied data cleaning procedures to remove duplicates, null values, and improperly formatted entries.
We constructed two distinct subsets of the dataset to evaluate the two proposed recommendation approaches:
Cold-start recommendation (for users without purchase histories): We selected a sample of 300 products and approximately 18,000 reviews. This subset was used to calculate product-level recommendation scores based on sentiment, sales volume, and sales period, as described in
Section 3.1.
Personalized recommendation (for users with prior interactions): We filtered the dataset to include only users who had interacted with at least two products and products purchased by at least two users. The resulting interaction matrix included 4570 users, 4774 products, and 16,241 user–item interactions. To evaluate the BNCF model, we split the dataset by user into training and test sets. One interaction was randomly selected per user for the test set (706 users), while the remaining were used for training (3723 users). The result is 11,057 training interactions and 706 test interactions.
4.2. Experimental Settings
4.2.1. Experimental Environment
All experiments were conducted on a consistent hardware and software platform to ensure the reproducibility of our results. The specific settings for our experimental environment are detailed in
Table 2.
4.2.2. Hyperparameter Tuning
The BNCF model was initialized with random parameters and optimized using the Adam optimizer [
29]. To determine the optimal configuration, we conducted hyperparameter tuning through a grid search strategy. The validation set was created by holding out the last interaction of each user from the training data. Training and validation loss during the model training for 200 epochs.
Table 3 presents the search ranges for each hyperparameter along with the optimal values selected. The final configuration includes an embedding size of 32, 4 negative samples per positive, 2 MLP layers, a learning rate of 0.001, a mini-batch size of 256, and a weight decay of 0.00001.
4.2.3. Loss Function and Training Sample Generation
For model training, we adopted the BPR loss, a pairwise objective function well-suited for implicit feedback datasets. We employed the standard BPR triplet sampling strategy for training both the NCF and BNCF models.
For each observed user–item interaction
, which is treated as a positive sample, we randomly selected one or more items
j from the set of items that user
u has not interacted with. Each training instance thus forms a triplet
, and the model is optimized to maximize the probability that user
u’s predicted preference score for positive item
i exceeds that for negative item
j. This negative sampling was performed uniformly and without replacement from the user’s set of non-interacted items. The sampling process was repeated dynamically in each training epoch to provide the model with diverse negative examples over time. The number of negative samples generated per positive instance was controlled as a tunable hyperparameter, with the optimal value detailed in our hyperparameter settings (see
Table 3).
4.3. Accuracy Verification of Product Recommendation Rankings for New Users Without Purchase Histories
- 1.
Data Processing
We used the same dataset described in
Section 4.1, which includes review data for the entire year of 2019. We selected five main product categories, each comprising three subcategories. Within each subcategory, we classified products based on the number of reviews and selected the top 20 products from each subcategory, resulting in a total of 300 products. Next, we performed sentiment analysis on the review texts and calculated three key scores for each product: sales volume score, sales period score, and user satisfaction score. These scores were then binarized (“high” or “low”) using threshold-based analysis and normalized. An overall score was subsequently computed to rank products within each subcategory, facilitating the classification and selection of products for further evaluation.
- 2.
Experimental Method
We randomly selected one subcategory from each of the five product categories. Then, we randomly selected 20 products from each subcategory, for a total of 100 products. To evaluate the accuracy of the top ten recommended product rankings, we surveyed twelve university students using a questionnaire. Each participant received comprehensive product information, including the product name, ID, genre ID, user rating, review title and text, sales volume, and sales period. After reviewing this information and the associated reviews, the participants selected their ten preferred products from each subcategory. This questionnaire did not require ethics committee approval as it did not involve animal or human clinical trials and did not raise significant ethical concerns. This study adhered to the ethical principles outlined in the Declaration of Helsinki. All participants were informed of the questionnaire’s purpose and provided their informed consent before participating. All participants were guaranteed anonymity and confidentiality, and participation was entirely voluntary. Each product received at least six evaluations. To measure ranking accuracy, we employed three widely used ranking metrics: mean reciprocal rank (MRR), mean average precision (MAP), and normalized discounted cumulative gain at rank 10 (NDCG@10). We computed these metrics using the participants’ rankings as the ground truth.
- 3.
Experimental Results
As shown in
Table 4, the average MRR across all subcategories is 0.323. The highest MRR is observed in the “Dryer” category (0.450), while the lowest is in “Sweater” (0.142), suggesting suboptimal accuracy for specific product types. The average MAP score is 0.618, with “Suitcase” achieving the highest (0.676) and “Essential Oil” the lowest (0.572), indicating moderate precision in ranking. The average NDCG@10 score is 0.567. The highest score is in the “Suitcase” category (0.623), and the lowest scores are in the “Essential Oil” and “Leather Shoes” categories (0.516 and 0.517, respectively), showing relatively acceptable top-10 ranking performance.
4.4. Comparison of NCF Model with and Without Bias
To evaluate the effectiveness of incorporating item bias into the proposed BNCF model, we compared it to a baseline NCF model that lacked bias factors. All experiments used 32-dimensional embedding vectors, a 64-dimensional hidden layer, and a 32-dimensional output layer. To ensure a fair comparison, we held the BPR loss function and other hyperparameters constant across models. Each model was trained for 200 epochs. We performed evaluations using NDCG and HR at k = 5 and k = 10.
Table 5 summarizes the results.
As shown in
Figure 6, increasing the number of negative samples improves both NDCG and HR metrics for both models. This improvement stems from the model’s enhanced ability to learn from more negative examples, which strengthens its generalization capability. However, using too many negative samples can significantly increase training time and the risk of overfitting. It highlights the importance of balancing performance with computational cost, which is crucial for achieving optimal efficiency and reducing resource consumption.
The BNCF model outperforms the NCF model when negative samples are scarce. It is evident from its higher NDCG@5 and NDCG@10 scores. It suggests that incorporating bias factors provides contextual information, enabling the model to capture latent user–item interactions accurately. As the number of negative samples increases, however, the performance gap between the two models narrows, although the BNCF model still performs slightly better.
On the other hand, the unbiased NCF model shows more consistent improvement with negative samples. It suggests that it relies more on interaction data. Larger sample sizes allow the model to more accurately identify user preferences, thereby improving hit rates and ranking quality. However, in this scenario, the relative contribution of item bias decreases while model complexity increases, which could lead to overfitting.
Therefore, we recommend using the BNCF model when the number of negative samples is limited as the bias factors significantly improve performance. For large-scale negative sampling with sufficient computational resources, the simpler NCF model without bias can maintain competitive performance while reducing the risk of overfitting and training overhead.
4.5. Comparison with Baseline Models
To validate the effectiveness of our proposed BNCF model, we conducted a comparative evaluation against three baseline models: the standard NCF, a classic model with explicit bias terms (BiasedMF) [
16], and a representative state-of-the-art model based on graph neural networks (LightGCN) [
31].
NCF: This model serves as our foundational baseline. Since it does not incorporate any form of bias modeling, it is used to isolate the performance gains achieved by our proposed bias integration.
BiasedMF: It is a classical matrix factorization model that incorporates scalar user and item bias terms. This model provides a basis for evaluating the effectiveness of explicit bias modeling in traditional recommender systems.
LightGCN: This model represents a state-of-the-art approach based on graph convolutional networks. It captures high-order connectivity within the user–item interaction graph and is widely recognized for its strong performance in recommendation tasks.
All baseline models were implemented using publicly available libraries. Their hyperparameters were optimized using our validation set to ensure a fair and meaningful comparison.
As shown in
Table 6, the standard NCF model achieves an NDCG@10 of 0.6209 and an HR@10 of 8.07%. Its lack of bias modeling limits its ability to personalize recommendations based on item-specific factors. While BiasedMF introduces explicit bias factors, it relies on linear interactions, and it cannot capture complex user–item dynamics. Consequently, its performance is substantially lower. The NDCG@10 is 0.4779. The HR@10 is only 1.98%.
In contrast, LightGCN uses the user–item graph structure and propagates latent preferences through multiple graph convolution layers. This model demonstrates strong overall performance, achieving an NDCG@10 of 0.9721 and an HR@10 of 19.73%.
Our proposed BNCF model combines the expressiveness of deep neural networks with domain-aware bias integration. By incorporating normalized sales volume, sales period, and user satisfaction scores into the embedding space, the BNCF model effectively captures both business-driven signals and nonlinear user–item interactions. The model achieves the highest NDCG@10 of 0.9848, indicating superior ranking precision. Its HR@10 reaches 19.69%, which is comparable to the performance of LightGCN.
These results confirm the advantage of combining multidimensional bias information with deep collaborative filtering frameworks. The BNCF model provides personalized accurate recommendations, outperforming traditional matrix factorization methods and advanced graph-based models in terms of ranking accuracy.
4.6. Ablation Study of Bias Components
To assess the individual contribution of each proposed bias component in the BNCF model, we conducted an ablation study. Specifically, we created three variants of the model by selectively removing one bias factor at a time. These factors include sales volume (Vol), sales period (Per), and user satisfaction (Sat). Each variant retains the remaining two components. For comparative purposes, we also included the standard NCF model, which excludes all bias factors entirely.
Each model variant was trained under the identical configuration detailed in
Section 4.2, with a mini-batch size of 256 and 4 negative samples per positive instance. The performance results, evaluated using NDCG@10 and HR@10, are presented in
Table 7.
The results in
Table 7 provide valuable insights into the role of each bias component. All model variants that incorporate bias factors outperform the standard NCF model. These results confirm the effectiveness of incorporating product-level attributes into the recommendation process.
Among the three components, sales volume (Vol) has the strongest influence on hit rate. Removing this factor causes HR@10 to decline from 19.69% to 19.41%. This observation indicates that product popularity plays a critical role in enhancing user engagement. In contrast, user satisfaction (Sat) has the most significant effect on ranking precision. Its removal results in a decrease in NDCG@10 from 0.9848 to 0.9773, while HR@10 slightly increases. It suggests a shift toward recommending more popular but potentially less relevant items.
The sales period (Per) component exhibits the least impact on both metrics. Its removal leads to a slight increase in NDCG@10, suggesting that this feature may provide limited additional value when sales volume (Vol) and user satisfaction (Sat) are already present.
In summary, this ablation study shows that the full BNCF model provides the most balanced and effective performance. Sales volume (Vol) contributes to exposure and engagement, while user satisfaction (Sat) enhances personalized ranking. The combination of these diverse factors enables the model to provide recommendations that are both accurate and user-centered.
4.7. Analysis of Hidden Layer Structures in BNCF Model
To enhance the model’s capacity for capturing intricate user–item interactions, we investigated the impact of various hidden layer architectures on the performance of the BNCF model. Specifically, we modified the original NCF model by introducing additional hidden layers to evaluate their effect. We defined the baseline architecture (Model A) as 32–64–32. We created three enhanced variants: Model B had one additional hidden layer, and Model C had two additional hidden layers. The bias component structure remained unchanged across all models.
These modifications were made based on the idea that deeper architectures can better capture complex nonlinear interactions, which could lead to richer feature representations and improved generalization. To analyze the interaction between architecture depth and negative sampling, we evaluated each model using 1, 4, and 8 negative samples.
As shown in
Table 8, increasing the model depth results in clear performance gains. For example, Model B achieved significant improvements over Model A when using one negative sample: NDCG@5 and NDCG@10 increased by
and
, respectively. In terms of hit rate, HR@5 and HR@10 increased approximately eightfold and
-fold, respectively. These results suggest that the added hidden layer in Model B significantly enhances the model’s expressiveness.
As shown in
Figure 7, increasing the number of negative samples from 1 to 4 substantially improves performance. However, the marginal benefit diminishes when the number of negative samples increases from 4 to 8. Furthermore, models with excessive depth may encounter training instability. For example, Model C exhibited numerical instability during loss computation, particularly at higher negative sampling rates. This predicament is ubiquitous in deep networks as vanishing or exploding gradients can compromise the integrity of the training process. As suggested by Kloberdanz et al. [
32], introducing a small epsilon (e.g.,
) into the logarithmic function can mitigate such numerical issues, particularly avoiding undefined operations like
.
Among the three variants, Model B consistently delivers the best overall performance, especially when using four or more negative samples. Model C, while deeper, does not provide significant performance gains over Model B and introduces greater computational and numerical complexity. Model A, as the baseline, shows the weakest performance.
These results indicate that Model B (32–64–64–32) offers an effective trade-off between expressiveness and computational efficiency. Its moderate complexity allows it to model nonlinear user–item interactions effectively without incurring the training difficulties observed in deeper models. Moreover, using four negative samples provides a practical balance between recommendation performance and computational cost. Based on these findings, we conclude that Model B with four negative samples is the optimal configuration for our BNCF model.
4.8. Analysis of Embedding Dimensions in BNCF Model
We investigated how the size of the embedding dimension affects the performance of the BNCF model. The original model, BNCF-32, used embeddings with dimensions of 32, 64, and 32. Next, we constructed a larger variant, BNCF-64, with dimensions of 64, 128, and 64. Both models included item bias terms of equal size. We conducted experiments using 1, 4, and 8 negative samples.
As illustrated in
Table 9 and
Figure 8, increasing the embedding dimension did not lead to performance improvement. BNCF-32 outperformed BNCF-64 across all evaluation metrics. For example, when using one negative sample, BNCF-32 achieved significantly higher NDCG@5 (0.7293 vs. 0.5725) and HR@5 (1.98% vs. 1.27%). Even when using eight negative samples, the improvement in BNCF-64 remained marginal and did not surpass BNCF-32.
Several factors may explain this degradation. First, increasing embedding dimensions substantially enlarges the parameter space, which raises the risk of overfitting, particularly in scenarios with limited training data. Second, complex models are more sensitive to noise and require more robust regularization. Third, deep models with large embeddings are susceptible to vanishing or exploding gradients, which complicates the training process and can lead to suboptimal convergence.
In contrast, BNCF-32 provides a well-balanced configuration: its complexity is sufficient to capture meaningful user–item interaction patterns without introducing the instability and overfitting risks associated with larger embeddings. The results suggest that further enlarging embedding dimensions beyond a certain threshold does not guarantee improved performance and may hinder generalization.
In conclusion, the original BNCF-32 architecture demonstrates superior robustness and efficiency compared to its higher-dimensional counterpart. It effectively balances representational power, training stability, and computational cost, making it the preferred choice for practical applications.
5. Discussion
In
Section 3.1 and
Section 3.2, we introduced two distinct recommendation methods. Despite the differences in dataset composition and experimental design, a comparison of the NDCG metrics allows us to assess and contrast their effectiveness. We refer to the recommendation approach for users with no purchase histories as Method 1 and the approach tailored for users with prior purchase behavior as Method 2.
5.1. Method 1: Rule-Based Recommendation for Cold-Start Users
In Method 1, we focused on five product categories: suitcases, sweaters, dryers, essential oils, and leather shoes, selecting 20 items per category for a total of 100 products. Recommendations were generated by ranking the products based on a composite score derived from sales volume, sales duration, and user satisfaction.
The experimental results showed an average NDCG@10 of 0.567, with the suitcase category achieving the highest score (0.623). Categories such as essential oils (0.516) and leather shoes (0.517) exhibited lower performance. Upon analysis, we attribute this disparity to the uniform weighting of the factors in our scoring function. Although the method assumes that users value sales volume, sales period, and satisfaction equally, user preferences vary significantly. Some users prioritize longevity in sales (sales period), while others focus more on recent popularity (sales volume) or product quality (user satisfaction). This variability leads to a divergence between system-generated rankings and user-perceived relevance.
For example, a product with a high sales volume, a short sales duration, and high customer satisfaction is likely to be popular and widely accepted. These attributes attract users. However, if the model continues to use a static weighting formula, it may misrepresent user intent, resulting in less precise recommendations. Evaluation metrics such as MAP and NDCG@10 reflect this issue. These metrics typically range between 0.5 and 0.6, suggesting that nearly half of recommendations deviate from user preferences.
5.2. Method 2: Deep Learning-Based BNCF for Personalized Recommendation
In Method 2, we employed a larger and more complex dataset consisting of 4570 users, 4774 items, and 16,241 records of user–item interactions. We proposed the BNCF model, which integrates three item-level bias features—sales volume, sales period, and user satisfaction—into the standard NCF architecture. After testing, we randomly selected the bias embedding vector parameters of 20 users from the model that showed the best performance and mapped them to the 0–1 interval using the sigmoid function.
Figure 9 shows the relationship between the three bias parameters of the users.
Under different configurations of hidden layers and negative sampling rates (1, 4, and 8), the model demonstrated strong performance. For instance, with one negative sample per user, NDCG@10 reached 0.8802; with four negative samples, it increased further to 0.9848. These results are markedly superior to those obtained in Method 1.
Several factors contributed to this performance improvement:
- 1.
Rich and Diverse Dataset:
Method 2 benefits from a significantly larger dataset, allowing the model to learn more nuanced patterns of user preferences and item interactions, which improves the generalization and accuracy of recommendations.
- 2.
Advanced Model Architecture:
Method 2 uses a neural collaborative filtering framework to capture the complex nonlinear interactions between users and items. It is a significant advantage over Method 1’s rule-based approach as it cannot adapt to intricate user behavior.
- 3.
Integration of Bias Features:
The inclusion of item-specific bias factors enhanced the model’s contextual understanding of products. By embedding attributes such as popularity and user satisfaction directly into the model, the BNCF approach provided a richer and more personalized recommendation output.
Collectively, these factors illustrate the importance of employing sophisticated models and multidimensional item data to improve the quality of recommendations.
5.3. Comparative Model Analysis Within Method 2
To further refine Method 2, we conducted comparative experiments by varying the architecture of the BNCF model. Specifically, we tested different hidden-layer configurations and embedding dimensions. The goal was to identify an optimal balance between model complexity, expressive capacity, and generalization performance.
Among the configurations tested, the 32–64–64–32 architecture consistently outperformed the others, as indicated in
Figure 10. This model achieved the best overall performance across both the NDCG and HR metrics. Compared to the simpler 32–64–32 structure or the larger 64–128–64 configuration, this architecture provided enhanced feature representation without incurring significant computational overhead or overfitting.
Our experiments also underscored the importance of the following:
Managing Numerical Stability: During training, we encountered instability issues in deeper models (e.g., gradient explosion/vanishing). Mitigating these through regularization techniques and loss function smoothing (e.g., adding small constants to avoid ) proved essential for stable optimization.
Selecting Optimal Negative Sampling Ratios: We found that increasing the number of negative samples to four substantially improved performance. However, further increases yielded diminishing returns and increased computational cost. Thus, using four negative samples offered a practical trade-off between performance and efficiency.
In summary, Method 2 demonstrates that thoughtful architectural design, feature integration, and training strategies can dramatically enhance recommendation performance. In real-world applications, the combination of deep learning techniques, multidimensional item features, and user interaction data offers a robust solution to the challenges faced by modern recommender systems.
6. Conclusions
In this study, we proposed two recommendation strategies tailored to different user scenarios: one for users without purchase histories and another for users with prior interactions. For users without purchase histories, we developed a rule-based recommendation method grounded in three key product attributes: sales volume, sales period, and user satisfaction. We quantified user satisfaction through sentiment analysis combined with rating scores. Then, we derived a composite score through threshold analysis to rank products. We evaluated the system’s effectiveness using product review data from Rakuten Ichiba. The results showed that the method could identify and recommend products with high user satisfaction, strong sales, and extended sales cycles. However, it failed to align with individual user preferences due to the lack of personalized behavioral data.
To address this limitation, we proposed a BNCF model for users with purchase histories. The BNCF model extends the standard neural collaborative filtering framework by incorporating item-level bias factors, specifically sales volume, sales period, and user satisfaction, into the recommendation process. We conducted extensive experiments to examine the impact of various conditions, including the inclusion of bias factors, different numbers of negative samples, variations in model depth, and embedding dimensionality. These experiments led to the identification of an optimal BNCF configuration, which demonstrated superior performance in both ranking quality and accuracy compared to the baseline models.
The comparative analysis of the two methods provided several key insights. Users without purchase histories present a significant challenge due to the absence of interaction data, which restricts the model to general recommendations based solely on product-level attributes. As a result, the capacity to deliver personalized and precise recommendations is considerably limited. In contrast, users with purchase histories provide valuable behavioral data that we can use to model intricate user–item interactions. The BNCF model uses deep learning techniques. These techniques identify latent relationships. The result is more precise and customized recommendations. The experimental results confirmed that this approach significantly improves recommendation accuracy and user satisfaction.
Future research will aim to expand the dataset and improve product classification granularity for users without purchase histories. We plan to represent product features within a three-dimensional space and introduce a dynamic weighting mechanism to account for variations in user preference priorities.
For users with prior interactions, future efforts will focus on enhancing the BNCF framework through the integration of the following techniques:
Graph neural networks (GNNs) can model complex relationships among users and items across different types of interactions.
Cross-modal attention mechanisms effectively fuse multiple data sources, including text, images, and numbers.
Residual connections can stabilize training and enhance the performance of deeper network architectures.
These advancements are expected to improve the accuracy, robustness, and adaptability of recommender systems in real-world applications.