Next Article in Journal
Vehicle Sideslip Angle Redundant Estimation Based on Multi-Source Sensor Information Fusion
Previous Article in Journal
Analytic Study on Φ-Hilfer Fractional Neutral-Type Functional Integro-Differential Equations with Terminal Conditions
Previous Article in Special Issue
Interpretable Diagnostics with SHAP-Rule: Fuzzy Linguistic Explanations from SHAP Values
error_outline You can access the new MDPI.com website here. Explore and share your feedback with us.
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Enhanced Recommender System with Sentiment Analysis of Review Text and SBERT Embeddings of Item Descriptions

1
Department of Computer Science and Engineering, Chung-Ang University Seoul, Dongjak-gu, Heuksuk-ro 84, Seoul 06974, Republic of Korea
2
Department of Artificial Intelligence and Software, Kangwon National University, Chungang-ro 346, Samchuk 25913, Republic of Korea
*
Author to whom correspondence should be addressed.
Mathematics 2026, 14(1), 184; https://doi.org/10.3390/math14010184
Submission received: 26 September 2025 / Revised: 14 December 2025 / Accepted: 30 December 2025 / Published: 3 January 2026

Abstract

As a transition from offline to online shopping is taking place in many societies, many studies have been conducted to align products with user preferences. However, the existing collaborative filtering technology has a small number of user–item interactions, resulting in data sparsity and cold start problems. This study proposes a recommendation system that combines customer preference for an item with quantitative indicators. To this end, the Amazon dataset is used to quantify an item’s attribute information through Sentence-BERT, and emotion analysis of the review data is performed. The model proposed in this study simultaneously utilizes the attribute information and review data of an item, proving that it provides higher performance than when using review text alone. Finally, we verified that our approach significantly outperforms traditional baseline models and rating predictions and effectively improves top K recommendation indicators. In addition, ablation studies found that integrating item attributes and review emotions performs better than using them individually. This means that the complementary synthesis of objective item meanings and subjective user emotions can model user preferences more accurately, enabling personalized recommendations.

1. Introduction

Due to the rapid development of information and communication technology, the shopping habits of consumers have changed significantly over time. In the past, most purchases were made in stores, but product purchases made through home shopping and Internet shopping have become common recently. These changes have caused an information overload of products and an increase in the diversity of items, which have caused difficulties in selecting the products that consumers want. In addition, for sellers, since recommending appropriate items in consideration of individual consumer preferences and leading them to purchases is directly linked to profit creation, the importance of item recommendation is increasingly emphasized. Accordingly, the importance of recommending customized products reflecting consumers’ individual preferences is becoming more prominent. In this context, it can be seen that the role of the recommendation system is becoming more important [1,2,3]. The recommendation system has been developed from collaborative filtering [4], which is a method of recommending products preferred by other users with similar tastes by utilizing information such as purchase records and ratings of users. The most widely used representative model among collaborative filtering models is Matrix Factorization (MF) [5]. Matrix Factorization (MF) is a model that learns latent factors based on interaction data between users and items and predicts user–item interactions by mapping user preferences and item characteristics to a low-dimensional vector space. This method can be effectively used for dimensional reduction in high-dimensional data and extraction of latent factors, but a limitation is that the performance of the model can be significantly degraded due to the cold start problem, wherein recommendations are not made properly in situations with scarce data [6]. Moreover, this model is limited in identifying the reason that a user rating is given, and as a result, it is difficult to accurately grasp the target user’s preference [7]. To overcome this limitation, various studies have been conducted that utilize additional information such as review text and item information beyond rating information [8,9,10,11]. The review text contains subjective information such as the characteristics of the product that are experienced by the user and their feelings and preferences for it [12]. Item information provides a detailed description of the product, helping with the selection of similar products. As such, data utilization of review data and item information contributes to mitigating data sparsity and cold start problems [8]. To extract high-quality, dense semantic features from item descriptions—a critical but often shallowly utilized data source—we specifically employ Sentence-BERT (SBERT) [13]. SBERT allows us to quantify rich item attributes more effectively than previous methods that relied on simpler embeddings like bag-of-words or classical Topic Models.
This study proposes a recommendation system that quantifies item attribute information through Sentence-BERT (SBERT) [13] using an Amazon [14] dataset and combines customer preferences for items with quantitative indicators (total score data) by performing sentimental analysis of review data. This system includes a rating prediction model that learns the individual characteristics of users and items in depth by comprehensively utilizing item information and review data, and a model that recommends the top K items that users are likely to prefer. Previously, a recommendation system model using review data has been implemented, but few recommendation system models have used item attribute information and review data at the same time. Unlike the existing recommendation system that only uses review data, the recommendation system model proposed in this study simultaneously utilizes item attribute information and review data to provide better performance. In addition, through experiments using two benchmark datasets, Amazon’s “Grocery and Gourmet Food” and “Video Games”, it is shown that the proposed model can be applied across various domains, validating the performance of both the rating prediction model and the model that recommends the top K items. The experimental results show that the proposed model has improved performance compared to the existing baseline model. These results suggest that the approach combining item attribute information and review text can reflect user preferences more accurately than when using rating data only. Therefore, the model proposed in this study can present recommendation results that comprehensively consider user preferences, which are difficult to grasp when based on only rating data, through item attributes and review text; this model is expected to improve the performance of personalized recommendations.

2. Related Work

2.1. Traditional and Advanced Collaborative Filtering

Matrix Factorization (MF) [5] is widely used as one of the core models in the area of collaborative filtering. Nevertheless, the scarcity of a rating matrix poses a problem involving the deterioration of the performance of the MF model [6]. More recently, advanced deep learning architectures, such as Graph Neural Networks (GNNs) like LightGCN [15] and Transformer-based models like SASRec [16], have been introduced to capture highly complex and sequential interaction patterns. However, these approaches often prioritize complex pattern modeling over explicit side information fusion. Various studies have been conducted to solve this problem, and in the process, increasing interest has been focused on Factorization Machine (FM) [17], which can effectively model all interactions between variables in a multidimensional sparse dataset using factorization techniques. FM has been proposed to overcome the limitation of the matrix decomposition method, which mainly considers linear interactions. In addition, a Neural Collaborative Filtering (NCF) [18] model with a neural network was proposed to model the complex nonlinear interactions between users and items. The NCF model further improves the superior performance of existing MF-based models, aiming to present a new solution to the data scarcity problem. Nevertheless, these models still have limitations with respect to processing scarce datasets [6]. To overcome these limitations, this study aims to solve the data scarcity and cold start problems by utilizing item attributes and review texts based on MF and NCF.

2.2. Recommendation Systems Based on User Reviews

Methods of utilizing review data have been studied to solve the data sparsity problem of collaborative filtering. Review data was applied to the recommendation system by extracting information such as topics [19,20,21,22,23] or sentiments [24,25,26,27]. Terzi et al. [19] proposed a user-KNN algorithm that calculates the similarity between users based on the similarity of review data instead of rating data. In their study, the similarity between two users was calculated by measuring the similarity of words in the review texts of the users, and this similarity score was used as a weight in the rating prediction process. Kim et al. [23] proposed a Convolutional Matrix Factorization (CMF) model that integrates a CNN (Convolutional Natural Network) and Probabilistic Matrix Factorization (PMF). This model uses a CNN to capture latent vectors to consider the context of review text. GPA is predicted by integrating the extracted characteristics into the PMF model. Shen et al. [24] proposed the Sentimental-based Matrix Factorization (SBMF) model, which adds sentiment information to the existing Matrix Factorization (MF) model. The sentiment score is calculated by summing the keyword sentiment score in the review data based on the score obtained from the newly constructed sentiment dictionary. The sentiment score calculated in this way is added to the PMF model and used for final rating prediction. Poirier et al. [25] trained a naive Bayesian model for negative and positive classes to infer ratings from review data. The inferred ratings were integrated and analyzed using the collaborative filtering technique. In this way, the performance of the recommendation system was improved by acquiring additional information in the same way as the topic extraction and sentiment analysis of review data and integrating it into the existing recommendation system model.
Wang et al. [28] analyzed reviews using the correlation between evaluation and review using a graph. There can be a lot of learning effects for semantic content. Vy et al. [29] also used the Bert model to derive user vectors from observed text reviews, calculate the annual relationship between text and the attributes of the reviews, and, based on this, represent user preferences in more detail and accuracy. However, both studies also have the disadvantage of not covering scarce data.

2.3. Recommendation Systems Based on Attributes of Items

Research has been conducted to overcome the limitations of collaborative filtering by utilizing not only review data but also item attribute information [10,25,30,31]. D Poirier et al. [25] proposed a content-boosted collaborative filtering model, which is a hybrid method that combines conventional collaborative filtering and content-based methods. After expressing the content information (film title, director, genre, etc.) collected by IMDb using the bag-of-words method, this information was additionally used to predict user ratings. This method showed better performance than the existing collaborative filtering and content-based models. WS Kang et al. [10] additionally utilized movies’ rating data and metadata (titles, genres, etc.) and integrated the embedding results using SBERT with the existing MF model to alleviate the data scarcity problem. Javaji et al. [32] proposed a new approach to create document embedding by combining two natural language processing models, SBERT and RoBERTa. Through this, a model for recommending book data was created. Jeong et al. [33] modified and used the SBERT model and alleviated the cold start problem in recommending movies.
Research using review data or item attribute information is being conducted continuously in the field of recommendation systems, but most research studies tend to focus on only one factor. In this study, by utilizing both review data and item attribute information, we intend to build an improved recommendation system that can reflect both the preferences of users and detailed information on items that are difficult to grasp simply based on ratings. In particular, research on rating prediction models is ongoing, but there is insufficient research on models that recommend the top K items to reflect user preferences in actual services. In fields that directly provide services to users, such as e-commerce, it is very important to build a model that effectively recommends the top K items to users [3]. Against this backdrop, the model proposed in this paper is intended to improve the performance of not only the rating prediction model but also that of the model that recommends the top K items.

3. Proposed Method

Figure 1 shows the structure of the overall model proposed in this study. The user set U and the item set I are defined as U = { u k } k = 1 k = M and I = { i j } j = 1 j = N , respectively. M is the total number of users, and N is the total number of items. Additionally, the item attribute information is defined as D = { d j } j = 1 j = N as a result of embedding using Sentence-BERT (SBERT) [13], and the sentiment analysis result extracted from the review data is defined as S = { s t } t = 1 t = L . Here, L denotes the total number of observed user–item interactions. Each index t corresponds to a unique pair (u, i) in the dataset. The goal of this paper is to build an extended collaborative filtering model by utilizing item attribute information and review data.

3.1. Attributes of Items and Context of Review

The items included in the Amazon dataset [14] have various forms, such as categories, descriptions, and images. Among them, the explanatory text that best represents the attributes of the item was used. The explanatory text was converted into a 768-dimensional vector by applying SBERT [13], which is defined as D = { d j } j = 1 j = N . This high-dimensional vector, d j , is treated as a fixed (non-learnable) feature vector. The converted high-dimensional vector is integrated into the existing baseline models, MF [5] and NCF [18], of the recommendation system. The review text used the VADER (Valence-Aware Dictionary and Environmental Reasoner) [11], implemented in the Natural Language Toolkit (NLTK), to identify positive or negative feelings about the item. As a result of the analysis, if the sentimental score was 0.2 or higher, it was classified as positive (0), and if it was less than 0.2, it was classified as negative (1), and these values were defined as S = { s t } t = 1 t = L . This method can grasp the user’s preference (like or dislike) for the item in more detail than simple rating analysis.

3.2. Feature Vectors of Users and Items

Our model learns latent representations for users and items. Let k be the dimension of the latent space (i.e., num_factors, set to 32 as defined in Section 4.4). User Feature Vector ( p u ): As shown in Figure 2a, a user u k is fed into a learnable user embedding layer (e.g., torch.nn.Embedding) of size M × k. This process maps the user’s one-hot index to their specific latent vector p u , which is a k-dimensional vector. This vector, p u , is a learnable parameter, initialized and updated during training.
Item Feature Vector ( q ~ i ): As shown in Figure 2b, the extended item feature vector q ~ i is constructed from two distinct sources. The first is the Item Latent Vector ( q i ), which, similar to the user vector, is the k-dimensional output of a separate, learnable item embedding layer of size N × k. The second source is the Item Representation Vector ( d j ), which is the 768-dimensional fixed SBERT embedding for the item, as defined in Section 3.1. These two vectors are then integrated to form the final item representation used by the prediction layer.
As shown in Equation (1), the item latent vector q i and the item attribute vector d j are concatenated and passed through a fully connected (FC) layer to create the final extended item feature vector q ~ i .
q ~ i = W · [ q i , d j ] + b
Here, [ q i , d j ] is the concatenated vector of dimension k + 768 (i.e., 32 + 768 = 800). To project this back to the latent dimension k, W is a learnable weight matrix of size k × (k + 768) (i.e., 32 × 800), and b is a learnable bias vector of dimension k (i.e., 32). This process allows the model to learn how to best integrate the rich semantic information from SBERT into the collaborative filtering latent space.

3.3. Prediction Layer

Our system applies the newly constructed user-characteristic vector p u and the ex-tended item-characteristic vector q ~ i to their respective baselines, MF and NCF, to predict ratings and recommend top K items.

3.3.1. Matrix Factorization (MF)

In the MF model (Figure 3), the base interaction score is calculated by the dot product of the user feature vector p u and the extended item vector q ~ i . To this, we add the standard user bias b u and item bias b i . Finally, the sentiment score s t is added, modulated by a weight β, to produce the final predicted rating r ^ u i (Equation (2)).
r ^ u i = p u T q ~ i + b u + b i + s t · β
Here, t represents the specific interaction index corresponding to user u and item i , and b u , b i , and β are all learnable scalar parameters. b u and b i capture user- and item-specific rating offsets, while β is a scalar weight that allows the model to learn the magnitude and direction of the sentiment’s influence on the final rating prediction.

3.3.2. Neural Collaborative Filtering (NCF)

The NCF model (Figure 4) consists of a Generalized Matrix Factorization (GMF) part and a Multi-Layer Perceptron (MLP) part, which are combined to capture both linear and non-linear interactions. The GMF path uses the user vector p u and the extended item vector q ~ i (defined in Section 3.2). As shown in Figure 4, the GMF layer produces a vector that is later concatenated. Therefore, we use an element-wise product (Hadamard product), denoted by ⊙, to produce the GMF output vector y g m f . This results in a k-dimensional vector that captures linear, factor-wise interactions.
y g m f = p u T q ~ i
The MLP path uses a separate, independent set of embedding vectors to model non-linear interactions. A user u k is fed into a learnable MLP-specific embedding layer of size M × k m l p to create the user vector u u . Similarly, an item i j is fed into a learnable MLP-specific embedding layer of size N × k m l p to create the item vector v i . We set the MLP latent dimension k m l p to be the same as k (i.e., 32). Similar to the GMF path, this item vector v i is extended using the SBERT vector d j , as shown in Equation (4).
v ~ i = W · [ v i , d j ] + b
Here, W m l p and b m l p are new, learnable parameters (a k m l p × ( k m l p + 768) matrix and k m l p -dim bias) that are independent of W and b from Equation (1). This extended item vector v ~ i is then concatenated with the user vector u u to form the input to the first MLP layer, z(1).
z ( 1 ) = [ u u , v ~ i ]
This input vector z(1) (of dimension k m l p + k m l p ) is then passed through a standard tower of X MLP layers. The operation of the L-th layer is shown in Equation (6), where z(L) is the output of the L-th layer, and W ( L ) , b ( L ) , and α ( L ) are the learnable weight matrix, bias vector, and ReLU activation function for that layer, respectively. The final output of the MLP stack is the vector y m l p = z(X).
( L ) ( z ( L 1 ) ) = α L ( W ( L ) z ( L 1 ) + b ( L ) )
y m l p = α ( h T ( L ) ( z ( L 1 ) ) )
Finally, as shown in Figure 4, the outputs from the GMF path ( y g m f ), the MLP path ( y m l p ), and the sentiment score ( s t ) are concatenated. This combined vector is passed through one final fully connected “NeuMF_Layer” to produce the predicted rating r ^ u i .
r ^ u i = α ( h T [ y g m f , y m l p , s t ] )

4. Experimental Results

4.1. Dataset Used

In this study, a dataset consisting of a large corpus of product reviews, collected by Amazon.com, was used [14]. The experiment was conducted by selecting the widely used Grocery and Gourmet Food and Video Games categories in different domains. These two categories were specifically chosen to test the model’s generalizability and robustness, as they represent contrasting domains of consumable goods (“Grocery”) and experience goods (“Video Games”), which differ significantly in their item attribute types (fact-based vs. abstract) and user review patterns (simple vs. complex sentiment).
In the data pre-processing stage, we utilized the following core data fields: reviewerID (User ID), asin (Item ID), overall (Rating), description, reviewText, and unixReviewTime (Timestamp). To ensure data integrity, interactions containing null or empty values in either the description or reviewText columns were discarded. Subsequently, for effective learning–verification–test set segmentation, users with three interaction records or fewer were excluded from this dataset. Table 1 and Table 2 contain the dataset’s statistics before and after preprocessing, respectively. The total number of users and items and the total number of interactions partially decreased after the preprocessing stage, resulting in a decrease in the number and density of items per user.
To divide the dataset, we adopted a strict time-based, leave-one-out evaluation strategy to prevent any temporal bias (data leakage), addressing potential concerns about evaluation design. For each user, we sorted all their interactions by timestamps. The most recent item was designated as the test set, and the second most recent item was designated as the validation set. All remaining (older) items for that user were used in the Training Set. This methodology ensures that the model is always trained on past data to predict future interactions, correctly simulating a real-world scenario. For objective evaluation of model performance, for each user, 100 items that the user did not interact with were randomly selected as negative samples and used for performance evaluation along with the positive test item.

4.2. Evaluation Indicators

In this study, the following indicators were used to evaluate the performance of the recommendation model: RMSE (Root Mean Squared Error), Recall@K, NDCG@K (Normalized Discovered Columnar Gain), and HIT@K [34,35,36]. RMSE is an indicator that is used to evaluate the performance of a rating prediction model, quantifying the difference between the rating y ^ i that is predicted by the recommendation model and the actual user rating y i ; the lower the RMSE value is, the higher the model’s prediction accuracy is.
The performance of the model that recommends the top K items was evaluated using the following indicators: Recall@K evaluates the inclusiveness of the recommendation system by measuring the ratio of the items included in the recommended top K items among the items that the user actually interacted with. NDCG@K is an index based on the importance and ranking of the recommended top K items that simultaneously considers the accuracy and ranking appropriateness of the recommendations. Finally, HIT@K evaluates how accurately the recommendation system captures user preferences by checking the case in which at least one of the top K recommendations corresponds to the user’s actual interaction item. Through these indicators, the performance of the recommendation system proposed in this paper was comprehensively evaluated.
R e c a l l @ K = N u m b e r   o f   r e l e v a n t   i t e m s   i n   K T o t a l   n u m b e r   o f   r e l e v a n t   i t e m s
N D C G @ K = l o g 2 l o g i + 1             if   positive   interaction   in   position   i   of   top   K   interactions 0                   if   positive   interaction   not   in   top   K   interactions
H I T @ K = 1   if   positive   interaction   in   top   K   interactions 0   if   positive   interaction   not   in   top   K   interactions

4.3. Baseline Model

To evaluate the performance of the proposed methodology, a comparison with two basic baseline models was performed: Matrix Factorization (MF) and Neural Collaborative Filtering (NCF). These existing baseline models utilize only interaction data (User ID, Item ID, Rating). The difference in performance was analyzed by additionally integrating item attribute information and review data information with the baseline models.
  • Matrix Factorization: This is a model that predicts interactions by learning latent factors from interaction data between users and items and expressing user preferences and item characteristics in a low-dimensional vector space.
  • Neural Collaborative Filtering: This is a model that combines the linearity of MF and the nonlinearity of MLP by combining Generalized Matrix Factorization and Multi-layer Perceptron.

4.4. Experimental Setup

To ensure the reproducibility of our experiments, this section details the experimental environment and hyperparameters. Our proposed models were implemented using Python 3.8 with the PyTorch 2.1 library. For embedding the item descriptions, we utilized the “bert-base-nli-mean-tokens” model from the Sentence-BERT (SBERT) library. The sentiment analysis of review text was performed using the SentimentIntensityAnalyzer (VADER) from the NLTK (Natural Language Toolkit) library (NLTK Project, Philadelphia, PA, USA), utilizing the “vader_lexicon”. All experiments were conducted on a workstation equipped with an Intel(R) Core(TM) i5-6600K CPU @ 3.50 GHz, 32 GB of RAM, and running PyTorch with CUDA 12.5 on an NVIDIA GeForce RTX 3060 (12 GB) GPU. For the hyperparameters of our models, the embedding dimension (num_factors) was set to 32, the learning_rate was 0.002, and the batch_size was 512. We used the Adam optimizer (PyTorch; Meta AI, Menlo Park, CA, USA) with weight_decay of 1 × 10−5 and trained each model for 10 epochs, employing an early stopping strategy based on the validation set’s NDCG@10 score.

4.5. Results

The experimental results are shown in Table 3. Under each baseline MF and NCF, “baseline (MF, NCF) + side information” refers to a baseline model that additionally utilizes item attribute information and review text, which are the methodologies proposed in this study. As a result of the experiment, it can be seen that our proposed model has improved performance compared to the baseline model in most metrics. Through the RMSE index, it was confirmed that the performance of the rating prediction model for the Grocery and Gourmet Food and Video Games datasets was improved. In addition, as a result of analyzing Recall@K, NDCG@K, and HIT@K indicators, it was observed that the proposed model effectively recommends items that meet the user’s preferences. However, the performance improvement for the “Grocery and Gourmet Food” dataset on the NCF model (RMSE) and the MF model (Recall@K, NDCG@K, HIT@K) was insignificant.

4.5.1. Performance Comparison by Dataset

The proposed model performed comparatively better on the Grocery and Gourmet Food dataset than the Video Games dataset. The average number of users per item, which is an important indicator of user interest and degree of interaction for each item, can be checked through the dataset statistics in Table 1. A higher percentage was obtained for the Grocery and Gourmet Food dataset compared to the Video Games dataset, with 3.36 users on average for a single item. This high number of users means that there is more information about the item which has an important impact on the ability of the recommendation model to learn user preferences more precisely. Therefore, these factors are reasons why the recommendation model can output more accurate predictions and achieve better overall performance on the Grocery and Gourmet Food dataset compared to the Video Games dataset.

4.5.2. Ablation Study with Side Information

To analyze the independent contribution of each component, we conducted an ablation study. Table 4 shows the experimental results, measuring the performance change as we incrementally add our proposed components (SBERT item descriptions and review text sentiment) to each baseline model (MF and NCF).
The results of this ablation study demonstrate the influence of each information source. First, as a result of adding only item attribute information (e.g., “MF + item description”) to the baseline models, it was confirmed that the performance was improved compared to the baselines using only the rating. Second, as a result of the experiment using only review text (e.g., “MF + review text”), the performance was also improved compared to the baseline in most cases. However, an interesting phenomenon was observed in the Grocery and Gourmet Food dataset: the HIT@K, Recall@K, and RMSE indicators for the NCF model decreased unexpectedly, while the NDCG@K indicators rose slightly.

4.5.3. Recommendation Results

Table 5 showcases a successful recommendation case of the proposed model in the “Grocery and Gourmet Food” dataset. The user primarily purchased sweet and engaging snacks, such as marshmallow cones and Popin’ Cookin’ candy. The recommended items consist of nuts like “Planters” peanuts and healthy snacks like “KIND Bars.” This suggests that the model accurately captured the core category of the user’s preference (“Snack”) and successfully utilized the semantic attributes of the items through SBERT embedding and sentiment analysis to expand the recommendation to adjacent categories (e.g., energy bars) related to health and convenience. This case proves that the proposed model possesses high efficacy in attribute-based recommendation.
Table 6 clearly reveals a structural limitation of the recommendations in the “Video Games” dataset. The user purchased a Nintendo 3DS game (“Mario Kart 7”) and a multi-platform headset, demonstrating a preference for the Nintendo brand and the party genre. Among the Top 5 recommended items, three (Wii, New Super Mario Bros. Wii, Mario Kart 8) align with the Nintendo brand and preferred genre, proving the model’s ability to capture IP (Intellectual Property) and genre similarity. However, the remaining two items (Xbox 360 LIVE Points, Xbox 360 Wireless Controller) were recommendations that were different in nature from previously purchased items. Upon analysis, it is judged that the model over-interpreted the generic headset’s compatibility information (PS3, Xbox 360, PC) as a strong potential signal for the user’s use of a competitor platform (Xbox 360). This case clearly shows that even if the proposed model successfully captures the semantic similarity of items, it cannot avoid a domain-specific error (domain-specific error) that compromises the practical utility of the recommendation when structural constraints like platform compatibility—which is essential external metadata (External Metadata)—are not integrated. Table 5 shows an example of the recommendation result of a model that integrates item embedding information and sentimental analysis information into a Matrix Factorization Model. For one user, the top 5 items recommended based on previously purchased items were presented. Analyzing the recommendation results, it can be seen that users tend to mainly purchase snacks, and the recommended items are also mainly composed of snacks.

5. Discussion

This section interprets the experimental results presented in Section 4.5 within the context of the existing literature, compares our findings with previous studies, and discusses the implications and limitations of the study, followed by suggestions for future research.
Our primary finding is that the integration of “side information”—specifically, SBERT embeddings of item attributes and sentiment scores from review texts—into traditional collaborative filtering models (MF and NCF) yields a general improvement in performance (Table 3). This enhancement is evident in both rating prediction (RMSE) and top K item recommendation (Recall@K, NDCG@K, HIT@K). This approach extends previous research that has typically focused on utilizing either item attributes [25,30,31] or user reviews [19,24,26] individually. For instance, while Kang et al. [10] achieved success by integrating SBERT with MF and Shen et al. [24] improved PMF with sentiment scores, our study investigates the simultaneous integration of these two heterogeneous data sources.
The results of our ablation study (Table 4) validate this combined approach. The models incorporating both information sources (e.g., “MF + side information” in Table 3) generally outperformed models using only item descriptions (“MF + item description”) or only review text (“MF + review text”). This suggests that the objective item attributes and the subjective user sentiments are not redundant but, rather, complementary. They work in concert to provide a more comprehensive model of user preferences than rating data alone can achieve.
The practical implication of this study is that leveraging both item content and user sentiment can effectively help mitigate the inherent data sparsity and cold start problems of recommender systems. However, our analysis also reveals several nuances and limitations. First, the performance gains were not uniform. As shown in Table 3, the improvement for top K metrics on the MF model and for the RMSE metric on the NCF model was insignificant for the “Grocery and Gourmet Food” dataset. As discussed in the original Section 4.5, we attribute this to the dataset’s extreme sparsity (0.0201% density). In such sparse environments, the underlying interaction patterns may be too weak for side information to provide a substantial boost.
Furthermore, the ablation study (Table 4) revealed a critical, dataset-specific anomaly. For the “Grocery and Gourmet Food” dataset, adding only review text to the NCF model (“NCF + review text”) unexpectedly decreased performance on HIT@K, Recall@K, and RMSE metrics. As hypothesized in our analysis (Section 4.5.2), this may be due to the low Type–Token Ratio (TTR) of the review text in this category (0.0051), which implies low vocabulary diversity and thus limited utility of the sentiment data. This finding highlights that the quality and characteristics of side information are critical factors that can complexly influence model performance. We also note that the “Grocery” dataset’s overall superior performance compared to “Video Games” (discussed in Section 4.5.1) is likely linked to its higher average number of users per item (3.36), which provides a richer learning environment for the model. A final limitation is our use of general-purpose models, the “bert-base-nli-mean-tokens” SBERT model and the VADER sentiment lexicon, neither of which were fine-tuned on domain-specific data.
Based on these findings and limitations, we propose several avenues for future re-search. First, the model’s generalizability should be tested on diverse domains, such as “Electronics” or “Fashion”, which have different attributes and review characteristics. Second, a promising direction is the integration of embeddings from modern Large Language Models (LLMs). This study utilized SBERT, which was a state-of-the-art approach when the research was initiated. However, the recent prevalence of powerful LLMs offers a new opportunity to capture a much broader and deeper “global knowledge.” Future research could explore replacing or augmenting the SBERT embeddings with representations from LLMs to more effectively model the complex semantics of item descriptions and user re-views, thereby enhancing the recommendation system’s contextual understanding and accuracy.

6. Conclusions

In this paper, we proposed and validated a novel methodology for enhancing collaborative filtering by integrating two heterogeneous data sources: SBERT embeddings derived from item descriptions and VADER sentiment scores extracted from user reviews. Our experiments, conducted on two distinct Amazon datasets (“Grocery and Gourmet Food” and “Video Games”), demonstrated that the proposed hybrid model (MF + Side Information, NCF + Side Information) achieved drastically improved rating prediction accuracy. This was particularly evident in the Matrix Factorization (MF) baseline, where the RMSE saw a substantial reduction and the model generally outperformed traditional baseline models across top K item recommendation metrics (Recall, NDCG, HIT).
Furthermore, our ablation study confirmed that both item attributes and review sentiment individually contribute to performance, validating our hypothesis that they provide complementary and non-redundant signals. The significance of this research lies in its practical approach to mitigating the core challenges of data sparsity and the cold start problem in recommender systems. By successfully synthesizing objective item facts with subjective user experiences, this study shows that we can create a much richer and more comprehensive model of user preference than when relying on sparse user–item interaction data alone. This study’s primary contribution is the design and empirical validation of this specific multi-view hybrid approach.
While previous studies have focused on using either item attributes or review sentiment, our work provides a direct comparison and demonstrates the value of their synthesis. Additionally, our detailed analysis of dataset-specific anomalies (e.g., the low TTR in the “Grocery” dataset) contributes a nuanced understanding of when and why certain side information may fail or succeed, which is a valuable finding for researchers in this field.
Based on the limitations identified in our Discussion, we propose several avenues for future research. First, the model’s generalizability should be tested on diverse domains such as “Electronics” or “Fashion,” which possess different attributes and review characteristics. Second, a promising direction is the integration or augmentation of SBERT embeddings with representations from modern Large Language Models (LLMs). The recent prevalence of powerful LLMs offers a new opportunity to capture much broader and deeper “global knowledge” for item semantics.

Author Contributions

D.L. and T.L. contributed equally to this work. Their contributions to this paper are as follows: D.L. contributed to the methodology, software, formal analysis, investigation, resources, data curation, original draft preparation, and visualization. The contributions of T.L. were in reviewing and editing the writing, supervision, and project administration. Both authors also contributed to the conceptualization and validation of the paper. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by a 2022 Research Grant from Kangwon National University and Technology Innovation Program (RS-2024-00507228, Development of process upgrade technology for AI self-manufacturing in the cement industry), funded by the Ministry of Trade, Industry & Energy (MOTIE, Korea).

Data Availability Statement

The original data presented in the study are openly available in The Amazon review dataset at [https://nijianmo.github.io/amazon/] or [14] (accessed on 30 June 2025).

Acknowledgments

This study was supported by the Ministry of Trade, Industry & Energy (MOTIE, Korea).

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Das, M.; Morales, G.D.F.; Gionis, A.; Weber, I. Learning to question: Leveraging user preferences for shopping advice. In Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Chicago, IL, USA, 11–14 August 2013; pp. 203–211. [Google Scholar] [CrossRef]
  2. Burke, R.; Felfernig, A.; Goker, M.H. Recommender systems: An overview. AI Mag. 2011, 32, 13–18. [Google Scholar] [CrossRef]
  3. Schafer, J.B.; Konstan, J.; Riedl, J. Recommender systems in e-commerce. In Proceedings of the 1st ACM conference on Electronic Commerce, Denver, CO, USA, 3–5 November 1999; pp. 158–166. [Google Scholar] [CrossRef]
  4. Sarwar, B.; Karypis, G.; Konstan, J.; Riedl, J. Item-based collaborative filtering recommendation algorithms. In Proceedings of the 10th International Conference on World Wide Web, Hong Kong, 1–5 May 2001; pp. 285–295. [Google Scholar] [CrossRef]
  5. Koren, Y.; Bell, R.; Volinsky, C. Matrix factorization techniques for recommender systems. IEEE Comput. 2009, 42, 30–37. [Google Scholar] [CrossRef]
  6. Wei, J.; He, J.; Chen, K.; Zhou, Y.; Tang, Z. Collaborative filtering and deep learning based recommendation system for cold start items. Expert Syst. Appl. 2017, 69, 29–39. [Google Scholar] [CrossRef]
  7. Xiangnan, H.; Tao, C.; Min-Yen, K.; Xiao, C. TriRank: Review-aware Explainable Recommendation by Modeling Aspects. In Proceedings of the 24th ACM International on Conference on Information and Knowledge Management, Melbourne, Australia, 18–23 October 2015. [Google Scholar] [CrossRef]
  8. Srifi, M.; Oussous, A.; Lahcen, A.A.; Mouline, S. Recommender systems based on collaborative filtering using review texts—A survey. Information 2020, 11, 317. [Google Scholar] [CrossRef]
  9. Zhang, Z.; Zhang, D.; Lai, J. urCF: User Review Enhanced Collaborative Filtering; AMCIS: Bubendorf, Switzerland, 2014. [Google Scholar]
  10. Kang, W.S.; Lee, S.; Choi, S.M. A Matrix Factorization-based Recommendation Approach with SBERT Embeddings. J. Korean Inst. Inf. Technol. 2023, 21, 203–211. [Google Scholar] [CrossRef]
  11. Burke, R. Hybrid recommender systems: Survey and experiments. User Model. User-Adapt. Interact. 2002, 12, 331–370. [Google Scholar] [CrossRef]
  12. Chen, L.; Chen, G.; Wang, F. Recommender systems based on user reviews: The state of the art. User Model. User-Adapt. Interact. 2015, 25, 99–154. [Google Scholar] [CrossRef]
  13. Reimers, N.; Gurevych, I. Sentence-bert: Sentence embeddings using siamese bert-networks. arXiv 2019, arXiv:1908.10084. [Google Scholar] [CrossRef]
  14. Ni, J.; Li, J.; McAuley, J. Justifying recommendations using distantly-labeled reviews and fine-grained aspects. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, Hong Kong, 3–7 November 2019; pp. 188–197. [Google Scholar]
  15. He, X.; Deng, K.; Wang, X.; Li, Y.; Zhang, Y.; Wang, M. LightGCN: Simplifying and powering graph convolution network for recommendation. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, Virtual, 25–30 July 2020; pp. 639–648. [Google Scholar] [CrossRef]
  16. Kang, W.-C.; McAuley, J. Self-attentive sequential recommendation. In Proceedings of the 2018 IEEE International Conference on Data Mining (ICDM), Singapore, 17–20 November 2018; pp. 197–206. [Google Scholar]
  17. Rendle, S. Factorization Machines. In Proceedings of the 2010 IEEE International Conference on Data Mining, Sydney, Australia, 13–17 December 2010; pp. 995–1000. [Google Scholar]
  18. He, X.; Liao, L.; Zhang, H.; Nie, L.; Hu, X.; Chua, T.S. Neural collaborative filtering. In Proceedings of the 26th International Conference on World Wide Web, Perth, Australia, 3–7 April 2017; pp. 173–182. [Google Scholar] [CrossRef]
  19. Terzi, M.; Rowe, M.; Ferrario, M.A.; Whittle, J. Text-based user-knn: Measuring user similarity based on text reviews. In Proceedings of the User Modeling, Adaptation, and Personalization: 22nd International Conference, UMAP 2014, Aalborg, Denmark, 7–11 July 2014; pp. 195–206. [Google Scholar]
  20. Zoghbi, S.; Vulic, I.; Moens, M.F. Latent Dirichlet allocation for linking user-generated content and e-commerce data. Inf. Sci. 2016, 367, 573–599. [Google Scholar] [CrossRef]
  21. Qiu, L.; Gao, S.; Cheng, W.; Guo, J. Aspect-based latent factor model by integrating ratings and reviews for recommender system. Knowl.-Based Syst. 2016, 110, 233–243. [Google Scholar] [CrossRef]
  22. McAuley, J.; Leskovec, J. Hidden factors and hidden topics: Understanding rating dimensions with review text. In Proceedings of the 7th ACM Conference on Recommender Systems, Hong Kong, 12–16 October 2013; pp. 165–172. [Google Scholar] [CrossRef]
  23. Kim, D.; Park, C.; Oh, J.; Lee, S.; Yu, H. Convolutional matrix factorization for document context-aware recommendation. In Proceedings of the 10th ACM Conference on Recommender Systems, Boston, MA, USA, 15–19 September 2016; pp. 233–240. [Google Scholar] [CrossRef]
  24. Shen, R.P.; Zhang, H.R.; Yu, H.; Min, F. Sentiment based matrix factorization with reliability for recommendation. Expert Syst. Appl. 2019, 135, 249–258. [Google Scholar] [CrossRef]
  25. Poirier, D.; Fessant, F.; Tellier, I. Reducing the cold-start problem in content recommendation through opinion classification. In Proceedings of the 2010 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology, Toronto, ON, Canada, 31 August–3 September 2010; Volume 1, pp. 204–207. [Google Scholar]
  26. Zhang, Y.; Lai, G.; Zhang, M.; Zhang, Y.; Liu, Y.; Ma, S. Explicit factor models for explainable recommendation based on phrase-level sentiment analysis. In Proceedings of the 37th International ACM SIGIR Conference on Research & Development in Information Retrieval, Gold Coast, Australia, 6–11 July 2014; pp. 83–92. [Google Scholar] [CrossRef]
  27. Diao, Q.; Qiu, M.; Wu, C.Y.; Smola, A.J.; Jiang, J.; Wang, C. Jointly modeling aspects, ratings and sentiments for movie recommendation (JMARS). In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, New York, NY, USA, 24–27 August 2014; pp. 193–202. [Google Scholar] [CrossRef]
  28. Wang, K.; Zhu, Y.; Zang, T.; Wang, C.; Liu, K.; Ma, P. Multi-aspect graph contrastive learning for review-enhanced recommendation. ACM Trans. Inf. Syst. 2023, 42, 51. [Google Scholar] [CrossRef]
  29. Vy, H.T.H.; Pham-Nguyen, C.; Nam, L.N.H. Integrating textual reviews into neighbor-based recommender systems. Expert Syst. Appl. 2024, 24, 123648. [Google Scholar] [CrossRef]
  30. Behera, G.; Nain, N. Handling data sparsity via item metadata embedding into deep collaborative recommender system. J. King Saud Univ.-Comput. Inf. Sci. 2022, 34, 9953–9963. [Google Scholar] [CrossRef]
  31. Grivolla, J.; Campo, D.; Sonsona, M.; Pulido, J.M.; Badia, T. A hybrid recommender combining user, item and interaction data. In Proceedings of the 2014 International Conference on Computational Science and Computational Intelligence, Las Vegas, NV, USA, 10–13 March 2014; Volume 1, pp. 297–301. [Google Scholar]
  32. Javaji, S.R.; Sarode, K. Multi-BERT Embeddings for Recommendation System. arXiv 2023, arXiv:2308.13050. [Google Scholar]
  33. Jeung, H.; Jeon, J.; Lee, S. Movie Recommender System (BERT-More). J. Korean Inst. Inf. Technol. 2025, 23, 39–50. [Google Scholar] [CrossRef]
  34. Herlocker, J.; Konstan, J.A.; Terveen, L.G.; Riedl, J.T. Evaluating collaborative filtering recommender systems. ACM Trans. Inf. Syst. 2004, 22, 5–53. [Google Scholar] [CrossRef]
  35. Wang, Y.; Wang, L.; Li, Y.; He, D.; Liu, T.Y. A theoretical analysis of NDCG type ranking measures. In Proceedings of the Conference on Learning Theory, Princeton, NJ, USA, 12–14 June 2013; pp. 25–54. [Google Scholar]
  36. Wu, S.; Tang, Y.; Zhu, Y.; Wang, L.; Xie, X.; Tan, T. Session-based recommendation with graph neural networks. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019. [Google Scholar]
Figure 1. Structure of the proposed model. (From the text in which the item is described, d j is embedded through SBERT, and the attribute information i j of the item is embedded to create a q i vector. These two vectors are combined to create q ~ j . Similar to generating item feature vector q i , the p u vector, produced using user attribute information, is input to the baseline model, and the final item rating is predicted by combining the value s t analyzed through review data. Flames mean learnable models, and Frozen is fixed model).
Figure 1. Structure of the proposed model. (From the text in which the item is described, d j is embedded through SBERT, and the attribute information i j of the item is embedded to create a q i vector. These two vectors are combined to create q ~ j . Similar to generating item feature vector q i , the p u vector, produced using user attribute information, is input to the baseline model, and the final item rating is predicted by combining the value s t analyzed through review data. Flames mean learnable models, and Frozen is fixed model).
Mathematics 14 00184 g001
Figure 2. Feature vectors that are defined in the proposed model: (a) user feature vector, (b) item feature vector.
Figure 2. Feature vectors that are defined in the proposed model: (a) user feature vector, (b) item feature vector.
Mathematics 14 00184 g002
Figure 3. How features are calculated in Matrix Factorization (MF).
Figure 3. How features are calculated in Matrix Factorization (MF).
Mathematics 14 00184 g003
Figure 4. How features are calculated in Neural Collaborative Filtering (NCF) (adapted from [18]).
Figure 4. How features are calculated in Neural Collaborative Filtering (NCF) (adapted from [18]).
Mathematics 14 00184 g004
Table 1. Dataset before preprocessing (items/user means items per user, # means an amount).
Table 1. Dataset before preprocessing (items/user means items per user, # means an amount).
Dataset#User#Item#Interaction#Items/UserDensity
Grocery and
Gourmet Food
127,49641,2801,167,8893.080.0221%
Video Games55,22317,389568,9863.170.0592%
Table 2. Dataset after preprocessing (items/user means items per user, # means an amount).
Table 2. Dataset after preprocessing (items/user means items per user, # means an amount).
Dataset#User#Item#Interaction#Items/UserDensity
Grocery and
Gourmet Food
127,37937,900972,6533.360.0201%
Video Games55,21216,835461,8763.280.0496%
Table 3. Experimental results of RMSE, Recall@K, NDCG@K, and HIT@K for each dataset (R@10 means Recall@K, N@10 means NDCG@K, and H@10 means HIT@K, with K = 10).
Table 3. Experimental results of RMSE, Recall@K, NDCG@K, and HIT@K for each dataset (R@10 means Recall@K, N@10 means NDCG@K, and H@10 means HIT@K, with K = 10).
ModelGrocery and Gourmet FoodVideo Games
RMSER@10N@10H@10RMSER@10N@10H@10
Matrix Factorization3.6810.5040.3530.5033.1110.4820.2850.479
with side information1.1100.5040.3540.5031.1630.5010.2960.498
Neural Collaborative Filtering1.0810.4800.2950.4791.1510.4480.2540.446
with side information1.0810.4950.3490.4941.1300.4790.3000.477
Table 4. Experimental results for each item attribute and review text.
Table 4. Experimental results for each item attribute and review text.
ModelGrocery and Gourmet FoodVideo Games
RMSER@10N@10H@10RMSER@10N@10H@10
Matrix Factorization3.6810.5040.3530.5033.1110.4820.2850.479
only item description2.2750.5040.3540.5031.8270.4950.2900.493
only review text1.1220.5040.3540.5031.1950.4880.2910.486
Neural Collaborative Filtering1.0810.4800.2950.4791.1510.4480.2540.446
only item description1.0770.4810.3030.4801.1330.4710.3020.469
only review text1.0820.4700.2970.4701.1310.4660.2720.464
Table 5. Example of recommendation results.
Table 5. Example of recommendation results.
Previously Purchased ItemsRecommended Items (Top 5)
  • Yum Yum Marshmallow Cones 30 ct
  • Popin’ Cookin’ Make Bento
  • Japanese candy assortment 30 pcs, full of dagashi. “TONOSAMA CANDY”
  • Planters Dry-Roasted Peanuts, Dry-Roasted, Lightly Salted, 16 Ounce (Pack of 12)
  • Planters Peanuts, Honey-Roasted & Salted, 52 Ounce Canister (Pack of 2)
  • Kind Bars, Madagascar Vanilla Almond, Gluten-Free, Low Sugar, 1.4oz
  • KIND Bars, Dark Chocolate Nuts and amp; Sea Salt, Gluten-Free, 1.4 Ounce Bars, 12 Count
  • KIND Bars, Dark Chocolate Chili Almond, Gluten-Free, 1.4 Ounce Bars, 12 Count
Table 6. Example of recommended results for “Video Games”.
Table 6. Example of recommended results for “Video Games”.
Previously Purchased ItemsRecommended Items (Top 5)
  • Mario Kart 7
  • Turtle Beach-Ear Force PX22 Universal Amplified Gaming Headset-PS3, Xbox 360, PC
  • Turtle Beach-Ear Force PX22-Universal Amplified Gaming Headset- PS3, Xbox 360, PC-FFP [Old Version]
  • Wii
  • Xbox 360 LIVE 1600 Points
  • New Super Mario Bros. Wii
    Xbox 360 Wireless Controller-Glossy Black
    Mario Kart 8-Nintendo Wii U
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Lim, D.; Lee, T. Enhanced Recommender System with Sentiment Analysis of Review Text and SBERT Embeddings of Item Descriptions. Mathematics 2026, 14, 184. https://doi.org/10.3390/math14010184

AMA Style

Lim D, Lee T. Enhanced Recommender System with Sentiment Analysis of Review Text and SBERT Embeddings of Item Descriptions. Mathematics. 2026; 14(1):184. https://doi.org/10.3390/math14010184

Chicago/Turabian Style

Lim, Doyeon, and Taemin Lee. 2026. "Enhanced Recommender System with Sentiment Analysis of Review Text and SBERT Embeddings of Item Descriptions" Mathematics 14, no. 1: 184. https://doi.org/10.3390/math14010184

APA Style

Lim, D., & Lee, T. (2026). Enhanced Recommender System with Sentiment Analysis of Review Text and SBERT Embeddings of Item Descriptions. Mathematics, 14(1), 184. https://doi.org/10.3390/math14010184

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop