1. Introduction
With the rapid advancement of information and communication technologies and the broad use of the internet, the e-commerce market has continued to expand [
1,
2]. As new products are introduced yearly, users can conveniently browse and purchase various items online [
3]. However, this increased availability of options has led to the problem of information overload, where users must select suitable items from a large volume of information [
4]. In response to this challenge, businesses have recognized the growing importance of recommender systems that provide personalized item suggestions based on individual user preferences. Recommender systems typically analyze users’ historical behavioral data, such as purchase history, click patterns, and viewing records, to estimate preferences and recommend items that are more likely to be accepted or purchased [
5]. This reduces users’ information search costs, enhances their overall shopping experience, and enables businesses to improve customer satisfaction and secure a sustainable competitive advantage [
6]. Accordingly, recommender systems have become a core component of e-commerce platforms, significantly improving user experience and business performance.
Collaborative filtering (CF) is one of the most widely used recommendation models in e-commerce, which provides suggestions based on users’ past behavior records (e.g., ratings, clicks, and visits) [
6,
7]. However, since these approaches rely solely on historical behavioral data, they often fail to capture the underlying motivations behind user preferences, which can limit recommendation accuracy [
8]. In particular, data sparsity refers to the lack of sufficient historical user–item interactions and is a major factor that negatively affects recommendation performance [
9]. To address this limitation, researchers have proposed models that incorporate auxiliary information, such as online review texts [
6,
10]. These reviews contain valuable insights into user preferences related to various attributes of an item. Integrating such information into recommendation models makes it possible to clarify the rationale behind specific recommendations and enhance prediction accuracy [
3,
11]. This approach is beneficial for users with limited behavioral history, often referred to as cold-start users [
12]. Consequently, many studies have focused on analyzing review texts to extract preference information and incorporate it into recommendation frameworks [
6,
9,
13].
Review-based recommendation approaches can generally be divided into implicit and explicit methods, depending on how features are extracted from the review texts [
14]. Implicit methods aim to capture latent feature representations without explicitly interpreting the semantic content of the reviews [
15]. For instance, convolutional neural networks (CNN) are commonly employed to encode review texts into dense representations that serve as embeddings for users and items [
7]. Although such deep learning-based methods have demonstrated strong performance in preference prediction, their limited transparency in the training process often reduces interpretability.
In contrast, explicit methods are grounded in domain knowledge and extract predefined features from reviews using analytical techniques such as topic modeling and sentiment analysis. For example, topic distributions derived from review texts using Latent Dirichlet Allocation (LDA) can be integrated into recommendation models to help explain why a specific item was suggested to a user [
16]. While explicit methods provide a high level of explanatory capability, they may compress complex textual information into simplified numerical forms, which can lead to the loss of valuable semantic details in the original text. Considering these characteristics, implicit and explicit approaches each offer distinct advantages [
14]. By leveraging the complementary strengths of both, it is possible to build more accurate and explainable recommendation models that fully utilize the information embedded in user-generated reviews.
Owing to the success of deep learning in computer vision and natural language processing, increasing attention has been given to the effective fusion of heterogeneous features [
17]. Early studies combined different types of features using element-wise product or simple concatenation. However, these approaches are limited in their ability to fully capture and represent the complex interactions between features [
18]. To address this limitation, recent studies have proposed advanced feature fusion methods that incorporate attention mechanisms to better model the complementarity between features [
19]. By jointly learning the interactions among diverse features, attention-based fusion techniques can capture how one feature influences another and generate richer and more expressive representations [
20]. In practice, the fused features produced by these techniques have been shown to significantly improve model performance [
19]. In this context, incorporating such fused features into a recommendation framework enables a more effective model by integrating the complementary strengths of implicit and explicit methods.
However, research considering the complementarities between explicit and implicit methods in recommender systems remains relatively limited. Previous studies have primarily evaluated these two approaches independently [
7,
21] or combined them using simple fusion strategies such as concatenation or weighted averaging [
22,
23]. Although attention-based fusion has demonstrated promise in related domains, its structured application within hybrid recommendation models, particularly for jointly modeling intra-feature relevance and inter-feature interactions, has not been sufficiently explored.
To address the limitations of previous studies, we propose a novel recommendation model called HNNER (Hybrid Neural Network for Explainable Recommendation), which integrates LDA-derived explicit features with BERT-based implicit representations through a hierarchical attention architecture. In our approach, self-attention mechanisms are employed to capture contextual dependencies within each feature type, while co-attention mechanisms are used to model their mutual interactions. This design enables more dynamic and interpretable fusion compared to prior methods, facilitating a richer integration of semantic and topic-level information for improved recommendation performance. To evaluate the recommendation performance of the proposed HNNER, we conducted experiments using three product category datasets from Amazon. The results demonstrate that HNNER outperforms various baseline models. The key contributions of this study are as follows:
We propose HNNER, a novel recommendation model designed to evaluate the complementary effects of explicit and implicit methods by fully leveraging review texts. The model comprehensively captures the complementarity between the two approaches to enhance the recommendation.
We introduce a self-attention mechanism and a co-attention mechanism to fully exploit intra-method and inter-method feature information. The self-attention mechanism emphasizes the importance of each feature within explicit and implicit methods, while the co-attention mechanism captures the complementarity between the two.
We validate the superiority of the proposed HNNER by comparing it with baseline models using real-world Amazon datasets across three product categories. The experimental results confirm that HNNER achieves better performance than existing models.
This paper is organized as follows.
Section 2 reviews related work.
Section 3 defines the research problem addressed in this study.
Section 4 presents the proposed HNNER model.
Section 5 describes the dataset and experimental design.
Section 6 reports and discusses the experimental results. Finally,
Section 7 concludes the paper and outlines directions for future research.
3. Problem Definition
The overall architecture of the proposed HNNER model is illustrated in
Figure 1. The model comprises three main components: the User–Item Interaction (UII) network, the Feature Extraction (FE) network, and the Preference Prediction (PP) network. Research involving explicit and implicit methods in review-based recommendation has been critical to understanding diverse user preferences and behaviors. However, previous studies have typically adopted the two methods separately using general approaches. In this study, we propose HNNER, a recommendation model that integrates fused features by capturing the complementary strengths of explicit and implicit methods. The model incorporates a self-attention mechanism to highlight the most informative aspects of each feature and a co-attention mechanism to model the dependencies between features, thereby generating enriched fused representations. These attentive vectors and fused features are subsequently passed to the Preference Prediction network to estimate user ratings.
Let
denote the set of interactions between users and items, where
is a tuple consisting of a user
, an item
, the user’s review text
, and the associated preference rating
. The objective of the proposed model is to learn a function
that predicts the preference rating
for a given user–item pair. The prediction function
can be defined as
where
represents the model parameters, and
is the predicted rating. During training, the model learns to minimize the difference between the predicted rating
and the actual rating
by leveraging user
, item
, and review text
. After training, the model outputs a predicted preference rating for unseen items.
4. HNNER Framework
In this study, we propose HNNER, a recommendation model that leverages the complementary strengths of explicit and implicit methods. Specifically, the model applies self-attention and co-attention mechanisms to effectively capture the importance of user preference features and the interdependencies between them. As illustrated in
Figure 2, the architecture of HNNER consists of three networks, which are described in detail below.
4.1. User–Item Interaction Network
The objective of the UII network is to learn the complex interactions between users and items. First, user
and item
are embedded to obtain their latent representations,
and
, respectively. These representations are computed as shown in Equation (2):
where
is the user embedding matrix,
is the item embedding matrix;
and
denote the numbers of unique users and items, respectively; and
is the number of dimensions of the latent vector.
and
represent the one-hot encodings of user
and item
.
Next, the user and item latent vectors are combined using a concatenation operation, as shown in Equation (3):
where
denotes the concatenation operator. The output vector
is then passed through a multi-layer perceptron (MLP) to capture high-order interactions via nonlinear transformations. This process is defined in Equation (4):
where
,
, and
represent the weight matrix, bias vector, and activation function (ReLU) at the
− 1st layer, respectively. As a result, the final output of this network is the high-level vector representation
which encodes the interaction between user
and item
.
4.2. Feature Extraction Network
The objective of the FE network is to extract fused feature representations embedded in the review text by leveraging both explicit and implicit methods. Let denote the review written by user for the item , where is the length of the review and represents the -th word in the review. The details of the feature extraction process are described below.
4.2.1. Explicit Method
The explicit method extracts features from the review text
using the LDA technique, which is a representative approach for explicitly modeling semantic content. Specifically, the LDA technique models the distribution of topics within a document and the distribution of words within each topic. First, for each document, the proportion of topics is initialized using a Dirichlet distribution, which, in turn, determines the word distribution for each topic. Each word in the document is then assigned to a topic based on this distribution. This process is iterated to optimize both the topic distribution across the document and the word distribution within each topic. Accordingly, we apply LDA to the review text
and extract topic probability values, as represented in Equation (5):
where
denotes the topic probability vector derived from the review text. This vector is then passed through a MLP, as defined in Equation (6):
where
,
, and
refer to the weight matrix, bias vector, and ReLU activation function at the
− 1st layer, respectively. The resulting
represents the feature vector extracted by the explicit method after MLP processing.
The self-attention mechanism computes the contextualized representation of the explicit feature vector
by modeling pairwise interactions between feature components. As shown in Equation (7), the attention output is computed as
where
represent the query, key, and value matrices, with
denoting the number of features and
their embedding dimension. These are derived via learned linear projections from
, their embedding dimension. These are derived via learned linear projections from
, as shown in Equation (8):
where
are trainable weight matrices. This formulation allows the model to compute dynamic attention weights
, which reflect the relative importance of each feature. The resulting vector
encodes the attended representation of the explicit features for downstream fusion and prediction.
4.2.2. Implicit Method
The implicit method extracts features from the review text
using a pre-trained BERT model, which has demonstrated strong performance in natural language processing tasks. We use the [CLS] token as the text embedding vector, since the [CLS] representation in BERT captures the overall semantic meaning of the input sentence. This is because the [CLS] token is connected to a layer that is fully connected to all other tokens [
32]. The output of the [CLS] token is a 768-dimensional vector, which is denoted as shown in Equation (9):
where
represents the embedding vector corresponding to the [CLS] token in the review text
.
To incorporate contextual information in both forward and backward directions, we apply a bidirectional gated recurrent unit (Bi-GRU) to the output of BERT. The GRU is a type of recurrent neural network that uses a reset gate and an update gate to effectively model sequential dependencies. Bi-GRU extends this mechanism by processing the input in both directions to enhance context representation. The computation procedure is described in Equation (10):
where
denotes the feature representation extracted by the implicit method. This output is then passed through a self-attention mechanism to identify the most informative features. The attention operation is defined in Equation (11):
where
is the attentive representation vector computed by the self-attention mechanism, which reflects the relative importance of the features extracted by the implicit method. Consequently, the outputs of this process are
, the latent feature representation, and
, the attention-weighted representation vector that emphasizes significant components in the implicit method.
4.2.3. Feature Fusion
Feature fusion complementarily combines explicit and implicit methods to extract enriched representations. The goal of this study is to leverage the strengths of both approaches to improve recommendation accuracy. To this end, we employ a co-attention mechanism that simultaneously learns the interactions among features, identifies their relative influence, and captures deeper semantic representations. The co-attention mechanism is implemented as a variant of the multi-head attention module used in transformer networks. The multi-head attention mechanism enables the model to jointly attend to information from different representation subspaces. As shown in Equation (12), it computes
independent attention heads, each defined as
where
,
, and
are the projection matrices for the
-th head. The outputs of all heads are concatenated and linearly transformed using the output projection matrix
, where
denotes the dimension per head. By assigning distinct projections to each head, the model captures diverse and complementary aspects of the input through parallel attention mechanisms.
In this context, the explicit and implicit feature vectors obtained from previous steps,
and
, are input into the multi-head attention mechanism, as shown in Equation (13):
where
is the review representation generated by attending explicit features to implicit features, and
is the reverse.
Subsequently, we apply residual connections and two independent feed-forward networks (FFNs), preceded by layer normalization (LN), to generate the fused encodings
and
. This process is described in Equation (14):
After obtaining
and
, the two vectors are merged via element-wise product, as defined in Equation (15):
where
denotes the element-wise product operation. The attentive representation vectors
and
, obtained from the explicit and implicit methods, respectively, are then combined with
to construct the final fused feature representation. This operation is defined in Equation (16):
where
denotes the fused feature vector obtained by complementarily integrating explicit and implicit features. As a result, the output of this module is the final fused representation
, which serves as a comprehensive semantic embedding used in downstream preference prediction.
4.3. Preference Prediction Network
The PP network estimates the final user rating based on both the user–item interaction representation and the fused review-based features. To accomplish this, we first combined
, obtained from the UII network, with
extracted from the FE network. This computation is represented in Equation (17):
where
denotes the concatenated vector of the two representations. This combined vector is then passed through a MLP to predict the rating, as defined in Equation (18):
where
,
, and
refer to the weight matrix, bias vector, and ReLU activation function at the
-th layer, respectively.
is the output of the MLP, which is subsequently input into the final prediction layer. The predicted rating is obtained through a regression layer, as described in Equation (19):
where
is the weight matrix of the output layer, and
denotes the predicted user preference score. Since rating prediction is a regression task, we follow prior research and train our model using mean squared error (MSE) as the loss function. The loss is calculated as shown in Equation (20):
where
denotes the set of user–item pairs used for training,
is the predicted rating, and
is the actual rating.
To optimize the model parameters and minimize the loss function during training, we adopt the Adaptive Moment Estimation (Adam) optimizer, which is based on the stochastic gradient descent (SGD) method. Adam dynamically adjusts the learning rate for each parameter and ensures that large gradients do not result in overly large parameter updates, thereby maintaining training stability.
In addition, given that neural networks are prone to overfitting, we implement several regularization strategies. First, dropout is applied, and the dropout rate is fine-tuned for each dataset. Second, if the validation loss does not decrease, the learning rate is reduced by 10% to enable more refined gradient updates. Finally, early stopping is employed to prevent overfitting when the validation loss does not improve for five consecutive epochs.
5. Experiments
To evaluate the recommendation performance of the proposed HNNER model, we conducted a series of experiments using datasets from three product categories on Amazon.com. This section aims to address the following four research questions (RQs):
RQ 1: Does the proposed HNNER model outperform baseline recommendation models?
RQ 2: Does the training time and training loss of the HNNER model indicate computational efficiency?
RQ 3: What is the most efficient fusion method for recommendation performance?
RQ 4: How do different hyperparameter settings influence the performance of the proposed HNNER model?
5.1. Datasets
To evaluate the recommendation performance of the proposed HNNER model, we used publicly available datasets from the Amazon e-commerce platform, specifically the Musical Instruments, Digital Music, and Video Games categories [
33]. Amazon is the world’s largest e-commerce platform and is widely used in recommender system research due to its extensive data, including purchase histories and user-generated review texts [
6]. The review data spans from May 1996 to October 2018, covering over two decades of user activity and product interactions. The datasets vary in scale and content complexity: Video Games contains relatively longer and more descriptive reviews, whereas Digital Music tends to include shorter user feedback. Dataset sparsity is summarized in
Table 1, and all three datasets exhibit high sparsity levels exceeding 99%, a common challenge in recommender systems that motivates the need for leveraging auxiliary textual information. The dataset was preprocessed through the following steps. First, review texts were tokenized and converted to lowercase. Second, stop words, extra spaces, non-English characters, special characters, and numeric values were removed. Third, words with a frequency of three or fewer were filtered out to reduce noise. Fourth, we excluded users who had purchased five or fewer items to ensure sufficient interaction data per user. The final datasets were randomly split into training (70%), validation (10%), and test (20%) sets. A detailed statistical summary is presented in
Table 1, including the number of users, items, reviews, and sparsity ratios.
5.2. Metrics
We used Mean Absolute Error (MAE) and Root Mean Squared Error (RMSE) as evaluation metrics to measure the predictive performance of the proposed model. These metrics are widely adopted in recommender system research [
6]. MAE is calculated by dividing the sum of the absolute differences between the actual and predicted ratings by the total number of rating instances. It treats all prediction errors equally, regardless of their magnitude. RMSE is computed by taking the square root of the average of the squared differences between the actual and predicted ratings. Compared to MAE, RMSE penalizes larger errors more heavily, making it sensitive to outliers. The formulas for MAE and RMSE are defined in Equations (21) and (22), respectively:
where
is the total number of ratings,
is the predicted rating, and
is the actual rating. Both metrics evaluate the accuracy of the predicted ratings by quantifying the discrepancy between actual and predicted values. Lower MAE and RMSE values indicate higher prediction accuracy.
5.3. Baseline Model
To reliably validatede the proposed HNNER model, we compared its recommendation performance with those of selected baseline models that are widely used in existing recommendation studies.
PMF [
34]: This model capture models user–item interactions by assuming Gaussian-distributed latent factors for users and items. This probabilistic approach is well-suited for dealing with sparsity and imbalance in rating data.
HFT [
16]: This model utilizes topic distributions to learn latent factors between users and items in review text. Specifically, we use LDA techniques to extract hidden topics within the review text and combine these with hidden factors obtained through rating matrix decomposition.
DeepCoNN [
7]: This model uses two CNN processors for each user review and item review to extract representations of users and items. It then uses a factorization machine to predict ratings.
NARRE [
21]: This model incorporates review texts and rating data, using CNNs to extract features and an attention mechanism to identify informative reviews. By assigning lower weights to irrelevant content, the model improves the quality of user–item latent representations for rating prediction.
SAFMR [
35]: This model integrates CNN-based feature extraction with a self-attention mechanism to model user and item representations from review texts. The self-attention mechanism captures the significance of different textual features, contributing to more accurate rating predictions.
HNNER-E: This model is a variant of the proposed HNNER model that considers only the explicit method. Specifically, we used topic probability values extracted through the LDA technique and applied a self-attention mechanism to consider the importance of features.
HNNER-I: This model is a variant of the proposed HNNER model that considers only the implicit method. Specifically, we used a feature representation extracted using the BERT and Bi-GRU techniques and applied a self-attention mechanism to consider the importance of features.
To sum up the different characteristics of these models,
Table 2 presents the data type and the key methods employed in these models.
5.4. Implementation
For a fair performance comparison, all models were evaluated under an identical experimental setup, consisting of 128 GB of RAM and an NVIDIA V100 GPU (NVIDIA Corporation, Santa Clara, CA, USA). Each result was averaged over five runs to ensure robustness. Hyperparameters for HNNER were tuned based on performance on the validation set. In our implementation, we utilized a pretrained BERT-base model to extract contextual embeddings from review texts. Specifically, we used the [CLS] token representation as the sentence-level embedding for each review. The maximum input length was set to 512 tokens [
9].
In our hyperparameter experiments, we fixed the embedding size to 64. The batch size was optimized among [64, 128, 256, 512, 1024] based on validation MAE. The learning rate was selected from among [0.001, 0.005, 0.0001, 0.0005, 0.00001, 0.00005] by monitoring convergence stability and final validation loss. The number of multi-heads was set to [2, 4, 6, 8, 10, 12], and the best-performing value was selected by comparing validation accuracy and computational cost. The dropout rate was optimized among [0.1, 0.2, 0.3, 0.4, 0.5], with the optimal value chosen based on generalization performance across epochs. Finally, the optimal hyperparameter values were batch sizes of 128 for the Musical Instruments and Video Games datasets and 256 for the Digital Music set, dropout rates of 0.1 for the Musical Instruments and Digital Music datasets and 0.2 for the Video Games set, and a learning rate of 0.001 and two multi-heads for all the datasets.
In addition, to set the number of topics in the LDA method, we set the maximum number of topics to 15 and selected the value that achieved the highest coherence score, which reflects the semantic consistency of discovered topics. Therefore, the optimal number of topics was set to 6, 7, and 12 for the Musical Instruments, Digital Music, and Video Games datasets, respectively.
To ensure a fair comparison, the hyperparameters of the baseline models were determined empirically. Each model was trained on the training dataset, hyperparameters were tuned on the validation set, and performance was evaluated on the test set. The tuned hyperparameters included the latent vector dimension, learning rate, dropout rate, and mini-batch size. For CNN-based models (i.e., DeepCoNN, NARRE, and SAFMR), CNN-specific parameters such as the number of kernels, window size, and number of channels were set according to the configurations reported in the original papers.
6. Experimental Results and Discussion
6.1. Performance Comparison with Baseline Models (RQ 1)
We evaluated the performance of the proposed HNNER model through comparisons with several baseline models on three Amazon product datasets. As shown in
Table 3, the proposed model consistently demonstrated superior performance across all evaluation cases. To verify the statistical significance of the observed improvements, we conducted paired
t-tests between the proposed model and each of the baseline models. The results confirmed that the performance gains achieved by HNNER are statistically significant (
p < 0.05) across all datasets. Based on these results, we conducted a detailed analysis from the following three perspectives.
First, the review-based models, including HFT, DeepCoNN, NARRE, and SAFMR, achieved better recommendation performance than PMF. The former group incorporates review text to predict user preferences, while the latter relies solely on rating data. Using ratings as the only information source is limited in its ability to comprehensively capture the underlying motivations of user purchasing behavior. In contrast, review texts contain rich semantic content that enhances the representation of user and item characteristics, thereby improving predictive performance.
Second, among the review-based models, those employing implicit methods (DeepCoNN, NARRE, and SAFMR) outperformed the explicit method (HFT). This is primarily because deep learning-based models can capture complex semantic structures and nonlinear relationships in review texts. Moreover, the application of regularization techniques such as dropout contributes to mitigating overfitting, further enhancing recommendation accuracy.
Finally, the proposed HNNER model outperformed all baseline models. Specifically, it achieved RMSE improvements ranging from 29.75% to 57.92% and MAE improvements ranging from 47.92% to 76.46% compared to PMF. These improvements suggest that incorporating review texts allows the model to capture nuanced aspects of user rating behavior that traditional rating-only models overlook. In comparison with the HFT model, HNNER achieved 17.11% to 23.37% improvement in RMSE and 27.24% to 35.40% in MAE. Furthermore, compared to DeepCoNN, NARRE, and SAFMR, the proposed model demonstrated improvements of 11.80% to 17.24% in RMSE and 19.71% to 24.18% in MAE on average.
These results demonstrate that a recommendation model integrating both implicit and explicit methods can effectively capture the complementary information embedded in user-generated content. To validate the contribution of each component, we conducted ablation experiments by evaluating the effects of implicit and explicit methods separately while maintaining all other components of the HNNER model. The results revealed that the version of the model utilizing only implicit methods outperformed the one using only explicit methods. This confirms that deep learning-based models are more effective at capturing the deeper semantic features present in review texts.
6.2. Comparative Analysis of Training Efficiency (RQ2)
HNNER combines explicit and implicit representations through two feature learning modules, resulting in a relatively more complex model architecture. To evaluate the training efficiency of HNNER, we compared the per-epoch training time and GPU memory usage of HNNER with those of baseline models using the full training datasets. Among the baselines, we focused particularly on DeepCoNN and SAFMR, which demonstrated the strongest performance among deep learning-based methods. The results are summarized in
Table 4 and
Table 5.
Despite its larger parameter size, HNNER achieves superior training efficiency by leveraging a design that processes individual review texts without aggregation. This approach substantially reduces input redundancy and minimizes unnecessary computation. Furthermore, the model incorporates shared attention layers across feature modalities, contributing to more compact and computationally efficient learning without compromising representational depth.
As shown in
Table 4, HNNER outperforms DeepCoNN and SAFMR with significantly reduced training times across all datasets, recording up to 56.9% and 47.2% faster training on the Digital Music and Video Games datasets, respectively. Moreover,
Table 5 reveals that HNNER consumes the least GPU memory among all models, indicating efficient memory utilization despite its parameter scale. This contrast—between high parameter count and low memory consumption—suggests that the model’s architectural design effectively decouples representation richness from computational burden.
These empirical findings are corroborated by
Figure 3 and
Figure 4, which visualize the convergence patterns and training dynamics. HNNER achieves rapid convergence and consistently yields the lowest training loss, MAE, and RMSE values across epochs, further confirming its ability to extract semantically rich and discriminative features from review texts with high computational efficiency.
6.3. Comparative Analysis of Fusion Strategies (RQ3)
In this section, we empirically evaluate multiple fusion strategies to determine which method most effectively captures the complementary strengths of explicit and implicit features. We compare the following approaches: Add, Average, Concatenation, and Element-wise Product, as well as ablation variants of the proposed attention-based fusion, specifically excluding either the self-attention or co-attention component.
The results in
Table 6 indicate that the co-attention fusion strategy achieves the best performance across all three datasets. This superiority is attributed to its ability to explicitly model the mutual dependencies between explicit and implicit features, thereby generating more discriminative fused representations.
Interestingly, the removal of either self-attention or co-attention leads to performance degradation, confirming that both mechanisms contribute complementary benefits. The exclusion of self-attention, which captures internal relationships within each modality, results in higher error values across two of the datasets. Likewise, removing co-attention, which facilitates cross-modal interaction, reduces the model’s ability to integrate semantic alignment between features. These findings validate the architectural design choice of incorporating both attention mechanisms.
In contrast, simpler fusion strategies, such as Add, Average, and Element-wise Product, perform consistently worse. These methods treat feature dimensions independently and fail to capture the rich inter-feature dependencies necessary for nuanced recommendation decisions. Based on these findings, we retain the full attention-based fusion scheme, comprising both self- and co-attention modules, in the proposed HNNER model.
6.4. Effect of Hyperparameter Settings (RQ4)
To validate the effectiveness of HNNER, we conducted experiments focusing on the impact of key hyperparameter settings that influence model performance. Specifically, we performed four additional experiments, each varying one of the following hyperparameters: the batch size, dropout rate, learning rate, and the number of attention heads (multi-head).
First, as shown in
Table 7, the best performance was achieved with a batch size of 128 for the Musical Instruments and Video Games datasets, and 256 for Digital Music. These results highlight the importance of tuning batch size, as excessively large batches may hinder effective parameter updates, while moderately smaller batches tend to yield better generalization.
Second, we examined the effect of the dropout rate on the performance of HNNER. As presented in
Table 8, a dropout rate of 0.1 yielded the best performance on the Musical Instruments and Digital Music datasets. In contrast, a rate of 0.2 resulted in slightly superior outcomes on the Video Games dataset. These findings suggest that lower dropout rates contribute to better generalization in the context of our architecture.
In contrast, performance degradation was observed as the dropout rate increased beyond 0.3. For instance, in the Musical Instruments dataset, the MAE increased from 0.570 at a 0.1 dropout rate to 0.782 at a 0.4 dropout rate, reflecting a relative error increase of over 37%. Similar patterns were observed in the other datasets across both MAE and RMSE metrics. These results suggest that although dropout helps mitigate overfitting, excessively high dropout levels may impair the model’s ability to retain salient feature representations, ultimately leading to underfitting. This highlights the importance of precisely calibrating the dropout rate to strike a balance between regularization and representational capacity [
4,
36].
Third, we assessed the impact of learning rate settings on model performance. As shown in
Table 9, a learning rate of 0.001 consistently achieved the best results across all datasets. This demonstrates the critical importance of selecting an appropriate learning rate to balance convergence speed and generalization. Improper learning rates may lead to underfitting or overfitting.
Finally, to evaluate the effect of the multi-head attention configuration, we varied the number of attention heads and analyzed its influence on performance. As shown in
Table 10, the best results were consistently observed with two attention heads. While multiple attention heads can improve feature diversity, they may also introduce excessive transformation steps, increasing the risk of cumulative errors due to the discretization of attention distributions. This can negatively affect the model’s ability to represent semantic relationships. Therefore, selecting an appropriate number of attention heads is crucial for maintaining model stability and effectiveness.
6.5. Case Study
Previous studies primarily used qualitative methods to assess the explanatory capability of recommendation models [
37]. To more intuitively evaluate the explanatory power of the proposed model along these lines, we present generated topic descriptions of a few randomly selected items from our test dataset.
Table 11 shows such an example. From this, we can draw observations on three aspects of the proposed model. The first column shows the image of the target item, followed by the images of two previously purchased items in the second column. The third column presents the user’s review of the target item, and the fourth highlights the relevant topic description. In the Review Text column, bold italic text indicates the parts of the user review that align with the extracted topic description.
The proposed model can provide meaningful explanations because it is capable of extracting topics that represent user preferences from user reviews through Topic Modeling. For example, in Case 2, “Color” and “Style” were highlighted as key themes, whereas in the other cases, “Design” and “Delivery” emerged as key themes. These results are based on the items that users had purchased.
In Case 4, although there are many aspects between the target item and the user’s past purchases, the core similarity of “Sports Game” was captured successfully in the topic description. This demonstrates that our model can accurately identify user preferences from review text.
Cases 2 and 6 illustrated that the proposed model emphasized different features (color and brand) of similar items (e.g., mouse) for each user. This reveals that the topic descriptions are personalized and demonstrates the effectiveness of the attention mechanism. This mechanism enables the model to more closely reflect individual preferences.
This case analysis reveals that the proposed model 1) can leverage review text to provide effective information regarding user preferences, which can be used to generate more accurate topic descriptions, and 2) works effectively by integrating explicit and implicit methods.
7. Conclusions and Future Studies
Numerous studies have addressed the data sparsity problem in recommender systems by extracting rich features from user reviews and incorporating them into model architectures. Review texts contain valuable user preference information related to item attributes, which not only clarifies the rationale behind recommendations but also enhances recommendation accuracy. Review-based recommender systems are typically classified into implicit and explicit methods based on how features are extracted from textual content. Both methods are effective and offer unique advantages. However, relatively few studies have explored the complementary use of both approaches within a unified framework.
To address this gap, we proposed a novel recommendation model, HNNER, which explicitly considers the complementarity between implicit and explicit methods. The model captures the importance of individual feature representations and the interdependence between them using self-attention and co-attention mechanisms. Experimental comparisons with baseline models confirmed the superior performance of the proposed approach. These findings emphasize the value of combining explicit and implicit representations for improving user preference prediction. Accordingly, this work offers a new direction for advancing research in recommender systems and demonstrates the practical effectiveness of the HNNER model.
Despite the promising results, this study has several limitations that suggest directions for future research. First, while the proposed model achieves interpretability and efficiency by leveraging LDA and BERT, it does not incorporate more recent advances in semantic modeling, such as BERTopic, RoBERTa, or GPT-based encoders. Future work should perform systematic comparisons with these models to better position the architectural choices within the current landscape of hybrid recommender systems. Second, the model operates solely on review texts and ratings, without integrating structured metadata (e.g., brand and price) or behavioral signals (e.g., clicks and browsing patterns). This modality limitation may restrict its generalizability across domains and reduce effectiveness in real-world applications. Exploring multi-modal fusion strategies could enhance contextual awareness and performance in heterogeneous environments. Third, the model’s performance in cold-start scenarios remains untested. For users or items with limited historical data, reliance on review-based features may be insufficient. Incorporating auxiliary features or pre-trained user and item profiles could help extend applicability to sparse or zero-shot settings. Fourth, practical considerations such as memory usage and scalability were not evaluated in this study. While training time comparisons were included, future work should report GPU memory consumption, computational complexity (e.g., FLOPs), and inference latency to assess real-world deployment feasibility. Additionally, ethical implications such as bias propagation from BERT embeddings and privacy risks from review parsing deserve further attention through techniques like debiased representations and differential privacy.