1. Introduction
Recent advancements in Natural Language Processing (NLP), exemplified by large language models such as GPT-5, have highlighted the growing importance of textual information across various domains. This development coincides with the rapid expansion of e-commerce platforms like Amazon, Taobao, and JD.com, which has driven the widespread adoption of recommendation systems in areas ranging from movie suggestions and talent acquisition to advertising. Among these, e-commerce stands out as a key domain due to the massive volume of user- and item-related textual data it generates—particularly user reviews. As a result, review-aware recommendation has become an increasingly important research area within the recommender systems community [
1].
Traditional recommendation systems primarily rely on collaborative filtering, which generates recommendations based on similarities among users or items. To address the data sparsity issue inherent in user-item rating matrices, matrix factorization techniques such as Singular Value Decomposition (SVD) have been widely used. These approaches decompose the rating matrix to extract latent features of users and items, thereby enabling the prediction of ratings for previously unrated items. For example, Sarwar et al. [
2] proposed an SVD-based algorithm that effectively predicted ratings for unrated items in movie recommendation systems. More recent work, such as that by Fan et al. [
3], has integrated self-attention networks with low-rank decomposition to construct context-aware representations from users’ historical interactions, achieving strong performance. Despite these advances, traditional collaborative filtering methods remain susceptible to persistent issues such as data sparsity and the cold-start problem.
With the rise of deep learning in recommendation systems, review-aware approaches have gained traction as an effective strategy to mitigate data sparsity. Early work by Jakob et al. [
4] showed that incorporating textual features such as price, service quality, and sentiment from user reviews could reduce prediction error. However, their method primarily focused on modeling one-to-one correlations between explicit features, overlooking potential latent features [
5]. Most existing review-aware recommendation methods rely on probabilistic topic models to uncover latent feature distributions of users and items from textual content. Nonetheless, these models adopt bag-of-words representations, which disregard word order and lack the essential local contextual information crucial for sentiment analysis [
6].
Furthermore, these methods primarily capture shallow, linear features and fail to fully exploit nonlinear latent features [
7]. To address this limitation, deep learning models have been effectively applied to capture word order information and incorporate various attention mechanisms, thereby enhancing the quality of text-based feature extraction [
8,
9,
10]. However, modeling user reviews remains a challenging task due to the inherent noise and sentiment embedded in the text. These factors hinder the extraction of key information from reviews and limit the ability to accurately associate it with user preferences.
To tackle this problem, attention-based models have been proposed. For example, NRM [
11] captures crucial review information using an attention mechanism, while D-ATT [
12] employs a dual-attention structure to model user context and interests. With the advent of the Transformer architecture [
12], BERT [
13] has emerged as a powerful tool for textual representation. DSMR [
14] adopts BERT to model user reviews for recommendation tasks and has achieved promising results.
Nevertheless, existing review-aware recommendation systems typically extract semantic information by computing the similarity between review embeddings, often overlooking the multi-dimensional nature of user reviews. For instance, some users may prefer brief reviews, which may still convey strong sentiment. We argue that incorporating these multi-dimensional aspects can lead to more accurate modeling of user preferences and more effective personalized recommendations.
In addition, user ratings are often inconsistent with their corresponding reviews—a phenomenon we refer to as polarity bias. For instance, a significant portion of 5-star ratings may be accompanied by lukewarm or even negative review texts, creating conflicting signals that degrade model performance. This bias is particularly evident when users tend to give extremely high or low ratings, regardless of the review content. While some recommendation models, such as CARP [
15] and U-BERT [
16], employ contrastive learning to model different polarities independently and obtain a more comprehensive understanding of user preferences [
17,
18,
19], they often fail to account for the imbalanced distribution of positive and negative reviews. This oversight may cause models to lean toward the dominant polarity, leading to biased recommendations.
Moreover, user preferences are inherently dynamic. These two challenges—polarity bias and temporal dynamics—are often intertwined. Many existing models incorporate temporal information to track the evolution of user behavior and improve recommendation accuracy [
20,
21]. However, external factors such as promotional events or holidays can also shape user behavior, often leading to concentrated purchasing activity over short periods. For example, a user’s polarity preference (e.g., being a ‘critical’ or ‘generous’ rater) may itself change over time, or a negative review from three years ago should carry less weight than a recent positive review. Existing models often treat these as independent problems, failing to capture their interaction.
In light of these challenges, we propose RARPT, a review-aware recommendation model that incorporates both polarity and temporality. To capture user preferences, RARPT applies dot-product attention to fuse review vectors with their associated attributes. At the same time, to model temporal shifts in user preferences, we adopt a sequential model to learn features from review sequences. Specifically, to address polarity imbalance, we introduce a cross-attention module that generates supplementary collaborative vectors from reviews of the opposite polarity. As shown in
Figure 1, our dataset analysis confirms the presence of polarity skew. Experimental results demonstrate that the proposed module effectively mitigates this issue and improves recommendation performance.
In summary, the primary contributions of this paper are as follows:
We propose RARPT, a new recommendation model that introduces a novel polarity balance mechanism to explicitly address data imbalance in review-aware recommendation. Unlike existing methods, RARPT models both polarity and temporality simultaneously.
Our primary technical contribution is the Polarity Balance Layer, which utilizes a cross-attention mechanism in a novel way to synthesize supplementary collaborative vectors from the dominant polarity class to augment the sparse class, effectively mitigating polarity bias.
We conduct a comprehensive set of experiments across five benchmark datasets to evaluate the effectiveness of our model. The results demonstrate that RARPT outperforms several classical and state-of-the-art baselines.
The remainder of this paper is organized as follows:
Section 2 reviews related work.
Section 3 presents the details of the proposed RARPT framework, including its key components.
Section 4 reports the experimental results that validate the effectiveness of our approach. Finally,
Section 5 concludes the paper and outlines directions for future work.
4. Experimental Setup
In this section, we first present the datasets used in our experiments along with their key characteristics. We then describe the experimental settings and hyperparameters. Finally, we introduce the baseline algorithms used for comparison with the proposed RARPT model.
4.1. Datasets
We conducted experiments on two publicly available datasets: Amazon and Yelp.
The Amazon dataset, derived from the Amazon e-commerce platform, is widely used in recommendation system research. In our study, we selected five of its sub-datasets: Toys and Games, Digital Music, Video Games, Office Products, and Tools & Home Improvement. The Yelp dataset was sourced from the 13th round of the official Yelp Challenge, which contains reviews of businesses such as restaurants and bars.
In addition to basic fields such as user ID, item ID, rating, and review text, we also leveraged metadata including the number of likes each review received. To enhance the reliability of training, we removed cold-start users and items, following the practice of [
38], ensuring that each user and item has at least five associated reviews. Reviews exceeding 512 tokens were truncated, and empty reviews were filtered out.
Ratings in both datasets range from 1 to 5 stars. We define reviews with ratings below 3 as negative, and those with ratings of 3 and above as positive. Notably, in real-world data, ratings are often skewed toward 4–5 stars, which can introduce bias and lead to overfitting. To address this, we balanced the dataset by randomly sampling reviews across all five rating levels in a 1:1:1:1:1 ratio, thereby ensuring equal representation and improving model robustness. We acknowledge that this balancing strategy is an abstraction and does not reflect the skewed, real-world distribution. However, this approach was intentionally chosen to create a controlled experimental environment. It specifically evaluates the model’s ability to handle polarity after removing the majority class advantage, thereby rigorously testing the effectiveness of our Polarity Balance Layer rather than allowing the model to simply overfit to the dominant 5-star ratings. A comparative study on the original, imbalanced data remains a key direction for future work.
Following the approach in [
38], we divided both datasets into training (80%), validation (10%), and testing (10%) sets in a time-aware manner, maintaining the same split ratio within each user’s interactions to preserve temporal consistency.
To evaluate the model’s performance, we compute the loss between the predicted scores and the ground truth ratings.
Table 2 presents detailed statistics for the evaluation datasets, including the distribution of positive and negative reviews based on our classification scheme.
4.2. Experimental Settings
To comprehensively evaluate the performance of our model and the baseline methods, we adopt Mean Squared Error (MSE) and Mean Absolute Error (MAE) as the primary evaluation metrics for rating prediction tasks.
We implemented our model using PyTorch 2.4 framework. The embedding matrix of each review text is initialized as a 768-dimensional vector using pretrained word embeddings from BERT. We use 12 heads and 12 layers in the Transformer component. Other training settings, such as the dropout rate and weight decay rate, remain the same as the original BERT. The multiple review attributes are incorporated during the data preprocessing stage, with the specific processing approach described in the review attribute attention layer.
We employ Xavier initialization [
38] for all trainable weights in the model. Grid search is applied to tune hyperparameters based on validation set performance. To prevent overfitting, we add dropout layers after all fully connected layers and major modules, with a default dropout rate of 0.2. We use Adam optimizer with a learning rate of
. We vary the number of reviews used per user and per item in the set {4, 6, 8, 10}, and experiment with learning rates in
. The training batch size is searched in {32, 64, 128, 256}.
4.3. Compared Methods
We compare the performance of our proposed method against a series of state-of-the-art review-based recommendation models, described as follows:
DeepCoNN [
9]: A deep learning model that utilizes user and item reviews through two parallel convolutional neural networks, which are connected at the final layer. It independently learns user and item embeddings from reviews and then passes the concatenated embeddings into a Factorization Machine (FM) for rating prediction.
NARRE [
29]: This model builds two parallel networks for users and items, each containing a convolutional layer followed by an attention mechanism. It not only aims to predict accurate ratings but also captures the usefulness of individual reviews.
DAML [
39]: DAML incorporates a local attention layer to filter review information and a mutual attention layer to learn user–item interactions. It unifies both ratings and reviews in a single neural architecture, and further integrates a Neural Factorization Machine to model high-order nonlinear feature interactions.
DSMR [
14]: DSMR adopts BERT for encoding review texts and uses an LSTM to model users’ temporal preference dynamics across reviews.
CARP [
15]: This model extracts aspect–opinion pairs from user and item reviews and introduces an emotional capsule network based on a bidirectional routing mechanism, enhancing both interpretability and rating prediction performance.
U-BERT [
16]: U-BERT leverages a pre-training and fine-tuning strategy to bridge the gap in user content sparsity by transferring knowledge from domains with rich review data to those with limited content.
MPCN [
15]: Built on a co-attentive learning scheme, MPCN identifies key reviews from both users and items, and then performs fine-grained word-by-word matching. This approach enhances interpretability and enables deep-level semantic interaction between reviews.
RPRM [
36]: RPRM explicitly links reviews with their associated attributes and introduces two novel loss functions along with a negative sampling strategy to jointly model user preferences and review attribute relationships.
To ensure a fair comparison across all baseline models, we applied early stopping during training and reproduced the models using the hyperparameters specified in their original papers. For consistency, we employed BERT-based embeddings for all comment inputs and adopted the same data preprocessing strategies across models. Specifically, we filtered comments based on their confidence scores, length, ratings, and temporal span to retain high-quality, informative reviews. Invalid entries—such as empty or overly short reviews—were removed. We also applied score-balancing preprocessing to ensure an equal distribution of ratings, as described in our approach.
For DeepCoNN and NARRE, we followed the configurations in [
9,
29], setting the learning rate to values in {0.005, 0.01, 0.02, 0.05}, batch size in {50, 100, 150}, dropout rate in {0.1, 0.3, 0.5, 0.7, 0.9}, and the number of latent factors in {8, 16, 32, 64}.
For DAML, the dimensionality of user and item latent vectors was set to 8, the sliding window size was set to 3, the dropout rate to 0.2, and the learning rate to .
For DSMR, we initialized the learning rate at 0.01 and adopted the dynamic adjustment strategy via the NoamOpt optimizer, as detailed in the original paper. Dropout rates were tested within {0.05, 0.1, 0.3, 0.5}, batch sizes in {3, 5, 8, 16, 32}, and latent factor dimensions in {32, 64, 128, 256}.
For CARP, we followed the recommended setup in [
40], setting the number of capsules and predefined thresholds to 5 and 3, respectively.
For U-BERT, we utilized the five domains used by the original authors—Books, CDs & Vinyl, Cell Phones, Electronics, and Video Games—for pre-training. Fine-tuning and testing were conducted on our selected Amazon subsets.
For MPCN, the number of pointers was adjusted across {1, 3, 5, 8, 10}.
For RPRM, the learning rate was varied between and , following the configurations provided in the original paper.
5. Experimental Results
To validate the effectiveness of our proposed method, we conducted extensive quantitative and qualitative experiments designed to address the following research questions:
Q1: Can our proposed method outperform both state-of-the-art review-based and traditional recommendation baselines?
Q2: How do different modules (such as review attribute attention layer) contribute to the overall performance of our proposed model?
Q3: How do key hyperparameters (such as the weights of collaborative vectors δ) affect the performance of our model?
Q4: Can the recommended results provide interpretability for the platform to complete other personalized services?
5.1. Q1: Performance Comparison
To address Q1, we first compare the performance of eight benchmark algorithms against our proposed RARPT model and its variants. To better illustrate the impact of individual review attributes on recommendation performance, we also conduct ablation experiments using only a single review attribute at a time. Additionally, we test the performance of RARPT under the full attribute attention setting, enabling us to assess which attributes have the greatest influence on the final recommendation results. The main findings are summarized below.
We observe that the basic MLP method shows a significant performance gap compared to other models across all datasets. This is primarily because the MLP lacks the capacity to model complex user–item interactions or leverage rich contextual information from review texts. By contrast, review-based models demonstrate substantial performance gains, confirming that user reviews serve as powerful auxiliary information to enhance recommendation effectiveness.
Among these, DeepCoNN performs consistently worse than NARRE and DAML, as the latter two integrate attention mechanisms to emphasize informative reviews, thereby learning more expressive user and item representations. Notably, MPCN achieves performance comparable to or even better than NARRE and DAML on certain datasets. We attribute this to MPCN’s ability to capture word-level interactions through its pointer mechanism, which enables fine-grained modeling of user preferences.
These results collectively underscore the importance of fine-grained review modeling, especially the ability to selectively attend to and interact with key textual elements in reviews, rather than treating them as flat sequences.
While RARPT demonstrates consistently lower MSE and MAE across all datasets, we note that formal statistical significance tests (e.g., paired t-tests) were not performed in this study. However, the consistency of the improvements across all six diverse datasets provides strong evidence of the model’s robust superiority. We recommend incorporating statistical significance testing in future validation work.
Table 3 presents the comprehensive performance comparison of RARPT against all baseline methods on the six datasets, evaluated by MSE and MAE. The ‘Module’ column lists all baselines and the different configurations of our RARPT model. The rows ‘Timestamp’ through ‘Emotional analysis’ represent RARPT using only that single review attribute for the attention layer, demonstrating the individual contribution of each attribute. The final ‘RARPT’ row shows the performance of our full model integrating all 9 attributes. Lower values indicate better performance.
The CARP model introduces capsule networks and employs a protocol routing mechanism while explicitly modeling both positive and negative reviews. Its performance shows a slight improvement over MPCN, suggesting that incorporating review polarity contributes to better generalization and thus enhances overall recommendation effectiveness.
Substantial performance gains are observed with U-BERT, DSMR, and RPRM, all of which leverage the powerful BERT language model for review representation. Among them, U-BERT benefits from cross-domain review knowledge transfer, using user reviews from domains with rich content to enhance those with sparse information. DSMR captures temporal dynamics of user preferences by integrating BERT-based embeddings with LSTM architectures. RPRM, on the other hand, achieves fine-grained modeling by jointly attending to multiple key review attributes, allowing the model to extract more nuanced preference signals.
Regarding our proposed RARPT model, we further performed attribute-level ablation studies to evaluate the contribution of individual review attributes as well as their combined effect. The results show that each single-attribute variant of RARPT consistently outperforms all baseline methods, and that the integration of all nine attributes yields the best overall performance. Notably, the review attributes ‘Timestamp’, ‘Positive proportion’, and ‘Negative proportion’ stand out with the strongest individual impact.
Timestamp aligns with RARPT’s Temporal Processing Layer, allowing the model to leverage sequential patterns in user behavior.
Positive/Negative proportions reflect the polarity distribution of user reviews and interact effectively with the Polarity Balance Layer, contributing to more balanced and context-aware embeddings.
Interestingly, the contribution of each attribute is dataset-dependent. For instance, polarity-related attributes are more influential in datasets with a higher proportion of negative reviews (e.g., Yelp and Video Games). This suggests that contextual characteristics of datasets (e.g., sentiment skew, domain specificity) should inform attribute weighting strategies, reinforcing the need for dynamic attention mechanisms.
Finally, based on
Figure 1, we observe that RARPT delivers the most notable performance gains in datasets with higher negative review ratios, validating the effectiveness of the collaborative vector transfer mechanism employed by the Polarity Balance Layer. Even in datasets with relatively few negative reviews (e.g., Office Products), RARPT still achieves meaningful improvements, demonstrating its robustness and adaptability across domains.
From a practical standpoint, this consistent reduction in MSE/MAE is highly valuable. For a real-world e-commerce platform, this translates to predictions that are much closer to the user’s true satisfaction. This enhanced accuracy can lead to higher user trust in the recommendations, improved click-through rates, and ultimately, increased customer retention and sales.
In summary, these findings confirm that the RARPT model benefits from comprehensive review attribute modeling, temporal context, and polarity balancing, enabling it to address limitations of prior models and to significantly improve recommendation performance across diverse datasets.
5.2. Q2: Ablation Experiment
To answer Q2, we conducted a series of ablation experiments across multiple datasets to investigate the contribution of each module within the RARPT model. Specifically, we designed the following model variants:
RARPT (Full Model): The complete model configuration using optimal hyperparameters, consistent with the results reported in
Table 2.
w/o BERT: This variant replaces BERT-based review embeddings with word2vec embeddings, reducing the semantic richness of review representations.
w/o attribute: This variant removes the review attribute attention layer, such that only individual reviews are modeled, without incorporating review attribute-level attention.
w/o sequence: This variant excludes the temporal modeling layer, thereby removing sequential modeling after the attribute attention step and directly proceeding to polarity balancing.
w/o polarity: This variant removes the polarity balance layer, which prevents the model from leveraging sentiment-based collaborative signals from reviews with opposite polarities.
The ablation experimental results are shown in
Table 4. We can observe that removing any module leads to a performance decline. We can also see that the MSE and MAE scores of different modules are ranked as follows: w/o polarity > w/o BERT > w/o attribute > w/o sequence. This indicates that the polarity balance layer contributes the most significantly to the overall performance of RARPT, followed by the BERT-based text embedding, then the attribute attention layer, while the temporal modeling layer has a relatively smaller positive impact. Notably, even the weakest variant still outperforms traditional baselines, demonstrating the robustness of the overall architecture.
The performance drop in the w/o BERT variant highlights the critical role of rich semantic embeddings in enabling downstream modules to mine latent information from review texts. The contextualized embeddings with BERT provide representational capacity compared to static embeddings like word2vec, making it easier to capture subtle signals such as tone, intent, or fine-grained sentiment, which are essential for personalized recommendations.
The polarity balance layer contributes most significantly to the recommendation performance on most datasets. As shown in
Figure 1, the performance gains are particularly notable on the datasets with a higher ratio of negative reviews (e.g., Yelp, Tools & Home Improvement). This demonstrates that distinguishing between positive and negative reviews and learning opposite-polarity collaborative signals can mitigate biases from imbalanced sentiment distributions and achieve better recommendation performance.
While the temporal modeling layer contributes the least among the four variants, it still improves performance compared to baseline models and single-attribute variants. Its effectiveness is more pronounced in domains where user behavior is temporally dynamic. On the other hand, the review attribute attention layer plays an essential role in enabling the model to assign different weights to various review attributes, which enriches representation learning beyond plain text.
Finally, our ablation study confirms the contribution of each component of RARPT to the overall performance. Among these components, the polarity balance layer and review embeddings with BERT are particularly critical. These findings further validate the effectiveness of our model and underscore the importance of multi-dimensional review modeling, textual richness, temporal dynamics, attribute-level attention, and sentiment-aware collaborative representation learning.
5.3. Q3: Experiments on Key Hyperparameters
To answer Q3, we conducted a set of controlled experiments to analyze how two key hyperparameters in the Polarity Balance Layer affect the overall performance of the RARPT model.
Our primary focus is on:
(1) The weight coefficient (δ) assigned to the balanced collaborative vector, and
(2) The number of reviews selected per user and item.
Figure 5 and
Figure 6, respectively, illustrate the impact of these two parameters on the MAE and MSE evaluation metrics.
To assess how the weight δ influences model performance, we tested values in the set {0, 0.3, 0.6, 0.9, 1.2}. As shown in
Figure 5, we observed the following trends:
For datasets with a lower proportion of negative reviews (e.g., Toys, Games, Office Products), higher δ values yielded better results. This indicates that incorporating collaborative vectors from opposite-polarity reviews helps supplement limited sentiment diversity.
For datasets with a more balanced sentiment distribution (e.g., Yelp), moderate δ values performed best. In these cases, excessively large δ values degraded performance—likely due to overemphasizing polarity signals at the expense of semantic coherence.
Across all datasets, δ values greater than 1 consistently failed to improve performance, suggesting that collaborative vectors should serve as auxiliary signals, not dominant features. Overweighting them disrupts the primary semantic flow of the original user-item representation.
In practice, we found the optimal range for δ to be between 0.2 and 0.6, striking a balance between incorporating polarity-driven signals and maintaining the integrity of original representations.
- 2.
The Number of Reviews per User/Item
We also explored the effect of the number of user/item reviews incorporated into the model, testing values in the set {4, 6, 8, 10}. As depicted in
Figure 6, the following patterns emerged:
Increasing the number of reviews generally led to better performance, with improvements tapering off as the count increased.
This diminishing return suggests that while a minimum number of reviews (e.g., ≥6) is essential to effectively characterize user behavior and support the model’s temporal sequence modeling, additional reviews beyond a certain point add marginal benefit while significantly increasing computational cost.
Based on these findings, we recommend selecting 6 to 8 reviews per user/item to balance accuracy and efficiency.
Summary:
Our hyperparameter analysis demonstrates that both the collaborative vector weight (δ) and the number of selected reviews play critical roles in optimizing RARPT performance. Tuning these parameters appropriately based on dataset characteristics—especially sentiment distribution and review density—is essential for maximizing the effectiveness of the Polarity Balance Layer and the overall model.
5.4. Q4: Interpretability
To address Q4, we analyze the interpretability dimension of the RARPT model’s recommendation mechanism. Since RARPT extracts multiple review attributes and integrates them with review polarity and temporal sequence features, interpretability can be derived from the attention weights assigned to each attribute during inference. These weights provide insight into which review characteristics most influence the final recommendation for different users.
To demonstrate this, we selected three users with distinct behavioral characteristics from the dataset using Python 3.12 for exploratory data analysis. Their profiles are summarized as follows:
User A ranks in the top 10% of all users in terms of number of reviews and temporal span (i.e., the time interval between their earliest and most recent reviews). Other attributes, such as average review length, are close to the 50th percentile, approximating the dataset mean.
User B ranks in the top 10% for average review length, suggesting a tendency to write longer and more detailed reviews. Other attributes, such as the emotional word ratio, are again around the 50th percentile.
User C ranks in the top 10% for the proportion of positive and negative emotional words, indicating a preference for emotionally expressive reviews. However, review length and other metrics remain close to the dataset average.
We then extracted attribute-level attention weights from multiple representative reviews written by each user. By averaging these weights, we derived a personalized attribute weight profile for each user, which is visualized in
Figure 7 (pie chart of attribute weight ratios) and
Figure 8 (histogram and line chart comparison of review attribute weights across users). In
Figure 8, the chart on the left (histogram) allows for direct comparison of attribute weights across users, while the chart on the right (line plot) highlights the different ‘preference signatures’ of each user.
User A, characterized by a high review frequency and wide temporal span, exhibits stronger attention weight toward the Timestamp attribute, suggesting that recent reviews have a greater influence on their preferences.
User B, who tends to write longer reviews, demonstrates higher attention weights for Length and Emotional Length, indicating that emotionally rich and extensive reviews are more predictive of their preferences.
User C, known for a high frequency of emotional expression, shows significantly higher weights for Positive and Negative Emotion Proportions, highlighting the model’s ability to detect emotional sensitivity in user behavior.
Interestingly, Timestamp, Positive Proportion, and Negative Proportion consistently receive high attention across all three users. This finding aligns with the earlier experimental results in
Section 5.1, reaffirming the importance of recency and emotional polarity in shaping recommendation accuracy.
These results indicate that RARPT not only improves predictive performance but also provides fine-grained interpretability at the user level. Online platforms can leverage such insights to deliver personalized recommendation explanations, support user segmentation, and design tailored content strategies based on dominant user traits.
6. Conclusions, Limitations, and Future Work
6.1. Conclusions
In this paper, we proposed a review-aware recommendation method, RARPT, based on polarity and temporality. Our method integrates user reviews, their attributes, and sequential information to address the challenge of imbalanced polarity distribution in user-generated content, thereby improving the performance of review-aware recommendations. Extensive experiments on real-world datasets demonstrate that RARPT outperforms several state-of-the-art recommendation algorithms. Furthermore, our results verify that generating collaborative vectors from opposite polarities effectively mitigates review imbalance, making recommendations less susceptible to the bias caused by an overabundance of positive reviews.
6.2. Limitation
Despite the strong performance, our work has several limitations that offer avenues for future research:
Dataset Balancing: As discussed in
Section 4.1, our experimental design used a balanced 1:1:1:1:1 rating distribution. While this isolates the model’s polarity handling, it deviates from real-world skewed data. The model’s performance on the original, imbalanced distributions has not yet been validated.
Dataset Scope: The Amazon and Yelp datasets are exclusively in English and focus on consumer products. The generalizability of RARPT to other languages, domains (e.g., media, academic citations), or datasets with naturally balanced polarity remains to be explored.
Statistical Validation: This study relies on comparative MSE/MAE values. We did not perform formal statistical significance tests (e.g., t-tests) to rigorously confirm that the improvements over baselines are statistically significant.
6.3. Future Work
Based on these limitations and our findings, we plan to extend this framework in several directions:
Imbalanced Data Evaluation: Conduct extensive experiments on the original, skewed datasets to validate the effectiveness of the Polarity Balance Layer in a real-world setting.
Cross-Domain and Multilingual Extension: Adapt and evaluate RARPT for multilingual datasets and different domains (e.g., talent recommendation, as initially suggested) to test its generalizability.
Advanced Temporal Modeling: Explore more sophisticated temporal weighting schemes beyond linear normalization, such as exponential decay or self-attentive temporal encoding, to better capture the nuances of preference drift.
Integration with LLMs: Investigate the use of large language models (LLMs) to replace the BERT encoder or to provide richer, more nuanced attribute extraction, potentially identifying implicit polarity cues that sentiment dictionaries miss.
Scalability: Assessing the training and inference efficiency of RARPT on large-scale, industrial-sized datasets to ensure its feasibility in production environments.
Domain Transfer: While we tested on multiple domains, future work could explore formal domain transfer techniques to apply a model trained on a data-rich domain (like ‘Video Games’) to a sparse domain (like ‘Office Products’).
User Fairness: Analyzing the model for potential fairness issues, such as whether it disproportionately benefits users with many reviews (rich data) versus new or ‘cold-start’ users (sparse data).