You are currently viewing a new version of our website. To view the old version click .
Applied Sciences
  • Article
  • Open Access

16 September 2021

A Hybrid CNN-Based Review Helpfulness Filtering Model for Improving E-Commerce Recommendation Service

,
,
and
1
Department of Big Data Analytics, Kyung Hee University, 26, Kyungheedae-ro, Dongdaemun-gu, Seoul 02447, Korea
2
School of Management, Kyung Hee University, 26, Kyungheedae-ro, Dongdaemun-gu, Seoul 02447, Korea
*
Author to whom correspondence should be addressed.
This article belongs to the Special Issue Deep Convolutional Neural Networks

Abstract

As the e-commerce market grows worldwide, personalized recommendation services have become essential to users’ personalized items or services. They can decrease the cost of user information exploration and have a positive impact on corporate sales growth. Recently, many studies have been actively conducted using reviews written by users to address traditional recommender system research problems. However, reviews can include content that is not conducive to purchasing decisions, such as advertising, false reviews, or fake reviews. Using such reviews to provide recommendation services can lower the recommendation performance as well as a trust in the company. This study proposes a novel review of the helpfulness-based recommendation methodology (RHRM) framework to support users’ purchasing decisions in personalized recommendation services. The core of our framework is a review semantics extractor and a user/item recommendation generator. The review semantics extractor learns reviews representations in a convolutional neural network and bidirectional long short-term memory hybrid neural network for review helpfulness classification. The user/item recommendation generator models the user’s preference on items based on their past interactions. Here, past interactions indicate only records in which the user-written reviews of items are helpful. Since many reviews do not have helpfulness scores, we first propose a helpfulness classification model to reflect the review helpfulness that significantly impacts users’ purchasing decisions in personalized recommendation services. The helpfulness classification model is trained about limited reviews utilizing helpfulness scores. Several experiments with the Amazon dataset show that if review helpfulness information is used in the recommender system, performance such as the accuracy of personalized recommendation service can be further improved, thereby enhancing user satisfaction and further increasing trust in the company.

1. Introduction

As the e-commerce market overgrows worldwide with the development of information technology and the popularization of mobile devices, various types of products continue to be released [1,2]. However, users face a time-consuming information overload problem in the purchasing decision-making process. Significantly, the issue of information overload multiplies because the user experiences the product indirectly online. Therefore, personalized recommendation services have been becoming important in providing personalized items or services to users. Global e-commerce companies such as Netflix, Amazon, and Google have introduced personalized recommendation services to help users make purchasing decisions [3,4,5]. They can decrease the cost of user information exploration and have a positive impact on corporate sales growth. For example, 75% of videos viewed by users on Netflix are provided through personalized recommendation services. Amazon generates 35% of its total revenue from items recommended to users through personal recommendation services [6].
Collaborative Filtering (CF) is the state-of-the-art recommendation model, which identifies users’ and items’ interactions and provides personalized recommendation services from quantitative information such as clicking, rating, and viewing [7,8,9,10,11]. However, such a methodology only models the action pattern without capturing qualitative preferences such as a motivation and a purchase reason for the item [12,13]. Therefore, such methodologies can raise the issue where recommendation performance decreases [1,14]. Recently, many studies have been conducted using various additional information to address the limitation of existing studies. Most e-commerce websites provide review modules where the users write reviews of their purchased items. According to Moore [15], 88% of users make purchasing decisions by referring to reviews when purchasing products. A review text can be helpful because it includes specific and reliable information such as the reason for purchasing and evaluating the item [14]. However, the existing studies of personalized recommendation services using reviews mainly focused on extracting sentiment features or exploring several attributes and utilized them by combining with the CF approach [16]. However, reviews include unhelpful content for inconducive purchasing decisions such as advertising, unmeaningful content, or fake reviews [17]. It is therefore indisputable that providing recommendation services without any considerations of the quality of the review may decrease recommendation performance [18].
In order to address the limitation of the existing study problems, this study aims to review helpfulness information in the personalized recommendation service that can affect users’ purchase decisions. Recently, the number of reviews of items has been increasing as more users purchase items on e-commerce websites. In Table 1, users can identify the product’s characteristics from reviews and utilize much of the information in the purchase decision-making process. However, users cannot refer to all reviews in the purchase decision-making process. Therefore, users have difficulties in exploring helpful reviews in the product purchase process. To address this issue, Amazon provides a review helpfulness voting module to confirm whether reviews are helpful in the purchase decisions process since 2007 [19]. The ranking of reviews is sorted through the number of review helpfulness votes, and the most voted reviews are marked at the top of the list. Because the review helpfulness information has an significant role in the user’s purchase decision-making process, and it plays an essential role in providing personalized recommendation services [20].
Table 1. Number of reviews received by Amazon Best Sellers items.
This study proposes a novel reviews helpfulness-based recommendation methodology (RHRM) framework that can support users’ purchasing decisions in personalized recommendation services. Our framework consists of three phases: a review semantics extractor, a user profile producer, and a user/item recommendation generator. First, in the review semantics extractor phase, we generate review representations hierarchically for review helpfulness classification. We first extract the review’s semantic representation using Convolutional Neural Network (CNN), then obtain two-way representations using the Bi-directional Long Short-Term Memory (BiLSTM) attention network and combine such representations to generate a final semantic representation. Since many reviews do not have helpfulness scores, we first propose a helpfulness classification model to reflect the review helpfulness that significantly impacts users’ purchasing decisions in personalized recommendation services. This CNN–BiLSTM hybrid model utilizes generated semantic representation to classify the helpfulness of the reviews. After review helpfulness information is classified, we send it to the user profile producer phase. Second, the user profile producer phase also utilizes helpfulness information classification results to update user profiles based on helpful reviews that the user has written about the item. Here, the updated user profile contains only user/item interactions that correspond to written helpful reviews by the user. Finally, the user/item recommendation generator utilized the most popular CF techniques to model users’ preferences on items based on their interactions profile produced in phase 2. We applied User-Based CF (UBCF), Singular Value Decomposition (SVD), and Neural Collaborative Filtering (NCF), the most popular models in CF techniques. We have conducted extensive experiments with the Amazon dataset. The results demonstrate that our framework can effectively improve the performance recommendations when reflecting the review helpfulness information. The contributions that this paper have made are summarized as follows:
  • This study first proposes the RHRM framework that has filtered the review helpfulness and reflected upon personalized recommendation services. It can enhance the recommendation performance because it reflects the purchasing behavior of the users who consider reviews when purchasing items.
  • This study has built a review helpfulness classification model using the combined CNN and BiLSTM that demonstrates excellent performance in the Natural Language Processing (NLP) study. We confirm the advantages of the combined CNN–BiLSTM hybrid model in semantic representation extraction through various experiments.
  • This study has conducted several experiments with the Amazon dataset. The results indicate that reflecting review helpfulness information can enhance the prediction performance of personalized recommendation services, increase user satisfaction, and raise confidence in the company.
The rest of the composition of this study is as follows. Section 2 describes the theoretical background for personalized recommendation services, review-based personalized recommendation services, and review text classification with deep learning approaches. Section 3 describes the proposed recommendation framework. Section 4 describes the experimental dataset, evaluation metric, and results. Finally, Section 5 discusses the discussion, limitations, and the future study.

3. RHRM Framework

In this section, we specifically describe the RHRM framework shown in Figure 1. Our framework consists of three phases: a review semantics extractor, a user profile producer, and a user/item recommendation generator. The first phase classifies the helpfulness of the review. It uses a CNN-BiLSTM hybrid model to generate review semantic representation and conduces review helpfulness classification [50,51]. The second phase produces the user profile that contains only user/item interactions that correspond to written helpful reviews by the user. The final phase utilized the most popular CF techniques to model users’ preferences based on their interactions profile. We introduce the details of each phase as follows.
Figure 1. Proposed RHRM framework.

3.1. Phase 1: Review Semantics Extractor

The first phase constructs a CNN-BiLSTM hybrid model to classify review helpfulness information. The architecture overview of the CNN–BiLSTM hybrid model is shown in Figure 2. This study builds a CNN–BiLSTM hybrid model with excellent classification performance in NLP studies to classify review helpfulness [52,53]. CNN can reduce the input features for prediction, and the correlation between each word and a final classification is not the same for all input words [54,55]. The BiLSTM is utilized for encoding long-distance word dependencies effectively [52,53]. Due to each of these advantages, various types of hybrid CNN-BiLSTM models have been proposed [40,50,52,53,56]. The CNN-BiLSTM hybrid model applied in this study was motivated by Rai et al. [51] and Liu et al. [50]. Existing models mainly used a combination of a single CNN network and a single BiLSTM network. The model was either used as a regression model to predict numeric values or applied to multiple classification problems. Following the common single model combination strategy, we applied multiple filter kernels and added a new attention mechanism layer to extract the review text’s semantic representation elaborately [47,57]. After generating a review-level semantic representation, the model classifies the helpfulness information for each review.
Figure 2. The architecture of CNN–BiLSTM hybrid model with the attention mechanism.
In this study, we gave R = r 1 ,   r 2 ,   , r n as a dataset for constructing a CNN-BiLSTM hybrid model with the attention mechanism. Each review contains five attributions [P, U, C, M, H], where P indicates the item features, U indicates reviewer features, C indicates textual features, and M indicates metadata features (e.g., ratings and timestamp). H indicates the helpfulness score that is measured as the ratio of helpful votes to the total votes, where H     0 , 1 . Let F as a n × m review feature matrix, where n is the number of reviews in the dataset and m is the total number of features. Z is an embedding vector of the predicted value for all reviews, where Z i represents whether a review is helpful or not. Finally, we define a helpfulness threshold value Θ 1 and Θ 2 . Therefore, Z i is calculated as follows:
Z i = 1 , if   H i > Θ 1 0 , if   H i < Θ 2
This study constructs a CNN–BiLSTM hybrid model that minimizes the prediction error of Z given F. The trained model is utilized to predict the helpfulness score of new review with unknown or unidentified helpful scores.
The CNN–BiLSTM hybrid model consists of three layers. The first layer is word embedding. Let R u , i = w 1 , w 2 , , w n be a review text, which indicates that the user u has written the review to item i, where n is the length in the review. Many existing text-mining models were mainly applying the one-hot encoding method to convert each word into a vector. However, such a method has a data sparsity issue where the matrix dimensions are too large, and most of the vector values are filled with zero. In this study, each word included in the review was converted into a vector type through the word embedding layer [57]. Thus, this study has applied word embedding f :   w n R D for each word in the review, and then each word is represented as a dense vector. Then, the review text is represented by a matrix E R n × d , where d is the dimension of the word embedding vector.
The second layer is a multichannel convolutional layer. It extracts the word-level semantic representation from the review text through different sizes filters. Then, it adopted a filter K j with a sliding window to performing a convolution operation. The convolution operation process can be defined as shown in Equation (2).
c j = ϕ ( E K j + b j ) ,
where ∗ indicates convolution operator, K j R k × m indicates the parameter of the filter kernel, and k × m denotes kernel size. b j is represented bias, and θ is the activation function ReLU, which is defined as Equation (3).
r e l u ( x ) = m a x ( 0 , x )
We add the max-pooling layer to the output of the convolution operation to retain the main semantics and suppress noise. The max-pooling operation is defined as Equation (4).
O j = max ( [ c 1 , c 2 , , c ( l t + 1 ) ] )
This study applied multiple filters of different sizes to extract the various semantic feature included in the review. Finally, the output of the convolutional layer is as Equation (5).
O = [ o 1 ,   o 2 ,   ,   o n ]
The third layer is an attention network. Each vector in the convolution layer output denotes the time step of the BiLSTM model. BiLSTM consists of two components: forward LSTM and backward LSTM. The forward LSTM captures the review semantic in the path from left to right, and the backward LSTM captures the sequence feature from right to left. This study defines the outputs of the forward and backward LSTMs as S t and S t , respectively. We applied Bi-LSTM for processing all terms in the path sequence to obtain two separate hidden state sequences. Let the defined input sequence o 1 ,   o 2 ,   , o n , the forward LSTM generate hidden states S 1 ,   S 2 ,   , S t , and the backward LSTM generate hidden states S 1 ,   S 2 ,   , S t .
S t = LSTM ( S t 1 , O t ) S t = LSTM ( S t 1 , O t ) m = [ S l ; S 1 ]
The BiLSTM connects the last hidden state of the forward LSTM with the first hidden state of the backward LSTM to generate the final representation. The embedding vector m consists of both forward and backward information of the path to efficiently capture the orderings. Finally, to highlight the importance of different words to the classification of review helpfulness, we added the attention mechanism layer in the CNN–BiLSTM hybrid model to further extract review features and highlight the review-helpfulness-related information. This study belongs to the feed-forward attention mechanism, defined as Equation (7).
h t = σ ( m i ) a t = exp ( h t ) i = 1 n exp ( h t ) Q = i = 1 m a t m i
where m t indicates the eigenvector output of the BiLSTM layer and σ is the attention learning activation function tanh. h t is the weight of the calculated generated attention. a t is the matching score indicating how well the model participates in the path when responding to a query relation. The weighted sum operation uses the SoftMax function for normalization to generate an attention probability. Q indicates a fusion feature of the representation multiplied by the probability of attention and the hidden state semantics encoding m t . Then, assign attention weight using the sum of weights.
The objective of this model is to compute the probability of the helpfulness score based on the semantic feature extracted from the review and classify the results, which can be defined as Equation (8).
Y = θ ( W s Q + b s ) ,
where θ indicates the Sigmoid activation function, W s indicates the weight matrix, and b s indicates the bias. Finally, the smectic input feature of review is classified as 0 or 1 and returned as output. A value of 0 output indicates that the review is unhelpful, and a value of 1 indicates a helpful review.

3.2. Phase 2: User Profile Producer

The second phase also utilizes helpfulness information classification results to update user profiles based on the user’s helpful reviews about the item. We applied the CNN-BiLSTM hybrid model that we constructed in the first phase to classify review usefulness information. Here, the updated user profile contains only user/item interactions that correspond to written helpful reviews by the user. Given that = r 1 ,   r 2 ,   , r m is a set of new reviews, each review can contain five attributions P , U ,   C , M , where P and U indicate item and reviewer features, respectively. In addition, C is a textual feature of the new reviews, and M is metadata features (e.g., ratings and timestamp). Let u i be a N × M review feature matrix, where N is the number of reviews in the dataset and M is the total number of features. Y is the embedding vector value in which the CNN–BiLSTM hybrid model predicts all new reviews, where Y u i represents review r u i is helpful or not helpful.
u i = 1   , if   r u i   ( user   u ,   item   i )   indicate   helpful ; 0 , otherwise .
where 1 for u i indicates that the user u has written a helpful review of item i. Similarly, 0 indicates that the review was unhelpful. Finally, we build a new user profile that contains only helpful reviews with the value 1 based on the classification results.

3.3. Phase 3: Recommendation Generator

To evaluate the performance of the proposed recommendation framework, we predict preference ratings by applying the UBCF, SVD, and NCF models, which are typically used in personalized recommendation services-related studies.
The first is the UBCF model. UBCF approach is the standard approach that is based on neighborhood models in recommender systems. The most common UBCF measures similarity between users, where s i m u , v represents user u and user v similarity [58,59]. The goal of this technique is to predict the user u preference rating r ^ u i for item i. Using the similarity measure, we identify the items rated by user u, most similar to i. The predicted rating is taken as a weighted sum of the ratings for neighborhood users, defined as follows:
r ^ u i = r ¯ u + v N i k ( u ) s i m u , v r v i r ¯ v v N i k ( u ) s i m u , v
The second is the SVD model. The latent factor approach has gained its popularity due to its high accuracy and scalability. This study focuses on methods that SVD of the user–item interaction matrix induces. The most common approach to estimating interaction components is the matrix factorization framework [1,12]. A common approach widely used in research relates each latent factor vector of a user to a latent factor vector for the item. Typically, this approach is applied to explicit feedback datasets while addressing overfitting issues through a regularized model. The SVD model is defined as follows:
min U . V Y M ( U V ) F 2 + λ ( U F 2 + V F 2 ) ,
where U and V indicates the number of latent factor users and items, respectively, and λ is used for regularizing the model. Y is the available ratings set, and M is the binary mask.
The third is the NCF model. The traditional latent factor model utilized a simple vector dot item to estimate the relationship latent vector. Therefore, such an approach cannot produce excellent results. To solve the latent factor technique’s limitations, the NCF model captures the interaction between the user’s latent vector and the item’s latent vector through a multi-layer perceptron [60,61]. The user’s latent vector and the item’s latent vector are inputs to multi-layer perceptron to predict user preferences. The output layer is used to predict user preference, and the model performs learning by minimizing the loss between the prediction and actual ratings. The NCF predictive model is defined as follows:
r ^ u i = f ( U T s u u s e r , V T s i i t e m U , V , θ ) ,
where s u u s e r and s i i t e m denote that the input layer consists of two feature vectors. U and V denote the latent factors for the user and item, respectively, and θ denotes the model’s parameter.

4. Experiments

4.1. Dataset Overview

We used Amazon Book (http://jmcauley.ucsd.edu/data/amazon/, accessed on 1 May 2021) publicly accessible datasets to evaluate the proposed performance of the RHRM framework [62,63]. The original datasets were collected from May 1996 to July 2014 and contain 8,872,495 reviews from 817,789 users on 562,073 items. Table 2 displays an example of attribution information from the Amazon Book Dataset. Each review contains (1) the ID and name of the reviewer, (2) the ID of the reviewed item, (3) the helpfulness information that including the number of helpful votes and the number of unhelpful votes, (4) rating information, (5) summary reviews and detailed reviews on the item, and (6) reviews published time.
Table 2. An example Amazon Book dataset review composition.
To conduct experiments effectively, we have built the CNN–BiLSTM hybrid model using the dataset (DS1) collected from May 1996 to December 2011, which contains 2,757,812 reviews from 281,661 users on 223,452 items. In addition, to evaluate the proposed recommendation framework performance, we use the dataset (DS2) collected from January 2012 to July 2014, which contains 6,114,683 reviews from 536,128 users on 338,621 items. The descriptive statistics of the two datasets are summarized in Table 3.
Table 3. Descriptive statistics of the two datasets.
Among these reviews in DS1, only total voting by at least 10 users as helpful or unhelpful are regarded as a training dataset for helpfulness classification [17,64]. Following the exiting study’s common strategy, we measured helpfulness score as the ratio of helpful votes to the total votes. The distribution of the measured helpfulness score is depicted in Figure 3. To better classify helpful or unhelpful reviews, we preferred only highly helpful reviews ( θ 1 > 0.9 ) and unhelpful reviews ( θ 2 < 0.2 ) as the training dataset. Figure 4 shows examples of helpful reviews and unhelpful reviews. With this filtered dataset, we train binary models for review helpfulness classification. The DS2 volume is large but highly sparse. Therefore, we filtered the dataset to contain only users with at least 20 interactions [60].
Figure 3. Distributions of helpfulness scores.
Figure 4. Examples of helpful reviews and unhelpful reviews.

4.2. Evaluation Protocols

To evaluate CNN–BiLSTM hybrid model classification performance in this study, we experimented with DS1 and adopted Accuracy, Precision, Recall, and F1-score as metrics. Furthermore, to evaluate the prediction performance of the proposed recommendation framework, we experimented with DS2 and adopted Mean Absolute Error (MAE) and Root Mean Square Error (RMSE) metrics. We set 80% of each dataset as a training dataset and measure the performance with the remaining dataset [12,58,59].
First, to evaluate the classification performance of the CNN–BiLSTM hybrid model, we adopted Accuracy, Precision, Recall, and F1-score as metrics using the confusion matrix shown in Table 4. Accuracy is the most used evaluation metric when measuring classification performance and represents the number of accurate classifications ratio of helpful and unhelpful reviews in the total classification results. Precision represents the contained ratio of actual helpful reviews to the classified helpful review by the model. The recall represents the contained ratio of the classified helpful review by the model to actual helpful reviews. The F1 score represents a balance weight average between precision and recall. A higher F1 score means a higher classification ability of the recommender system. The Accuracy, Precision, Recall, and F1-Score are defined in Equations (13)–(16).
A c c u r a c y = T P + T N T P + F N + T N + F P
P r e c i s i o n = T P T P + F P
R e c a l l = T P T P + F N
F 1 S c o r e = 2 × P r e c i s i o n × R e c a l l P r e c i s i o n + R e c a l l
Table 4. Confusion matrix example for evaluating the performance of helpfulness classification.
The MAE and RMSE are statistical accuracy metric that evaluate prediction performance by comparing the difference between predicted and actual ratings, as defined in Equations (17) and (18) [7,10]. The MAE gives the same weight regardless of the magnitude of the error between the actual and predicted ratings. However, RMSE gives a relatively high value weight with a large error between the actual and predicted ratings. When the value is low, the corresponding recommendation prediction is more accurate.
M A E = 1 N i = 1 N y u i y ^ u i
R M S E = 1 N i = 1 N ( y u i y ^ u i ) 2
where N is the total test dataset, y ^ u i is the predicted rating, and y u i is the actual rating by the user u for the item i .

4.3. Parameter Settings

For performing review text data preprocessing, we applied the NLTK (Natural Language Toolkit) package to remove stopwords, special characters, symbols, numbers, etc., included in the review [65,66]. For training CNN–BiLSTM hybrid model, we set an embedding dimension of 300, filter windows of 3, 4, and 5, a filter size of 100, and the number of hidden units in Bi-LSTM of 64. In addition, to solve the overfitting problem, the dropout rate was set to 0.5, batch size was set to 50, and Epoch was set to 100. As the optimization algorithm, Adam, which is widely used in previous studies, was applied, and the learning rate was set to 0.05 [67]. We set the review length of average and maximum length and the vocabulary size of several sizes [68]. Then, we set the optimal review length and vocabulary size based on classification performance. We applied the same parameters to baseline algorithms and compared the classification performance. For the CF algorithm, Pearson Correlation Coefficients measures similarity between users, and the neighbor size is set from 1 to 100. Furthermore, we set the latent factor sizes of SVD and NCF techniques to 8, 16, 32, 64, and 128 [60]. Experiments in this study were conducted using TensorFlow, Keras, and Surprise packages. All experiments were conducted in a computer environment with CPU Intel Core i9-9900KF, 64G of memory, and a GeForce RTX 2080 Ti.

4.4. Experimental Result

4.4.1. Review Helpfulness Classification Performance Comparison

In this section, we first study the effect of changes in vocabulary size and review length on the classification performance of the CNN–BiLSTM hybrid model. To retain the main semantics and suppress noise, first, we performed several experiments that set several vocabulary sizes from 20,000 to 104,702. Then, we select the maximum review length and the average review length for each vocabulary size to conduct experiments. Finally, we set the optimized vocabulary size and review length to train the CNN–BiLSTM hybrid model efficiently. We conducted the experiment five times and reported the mean and the standard deviation of the classification performance. Table 5 indicates the mean and the standard deviation of classification performances of the CNN–BiLSTM hybrid models with several vocabulary sizes and review lengths. We found that the performance worsened when the vocabulary sizes were set too high. Thus, optimal vocabulary sizes should be used to improve the recommendation performance. In addition, we found that maximum review lengths should be used to improve the recommendation performance. As a result, the optimal vocabulary size was 80,000, and the maximum length of the review should be chosen. We set the optimal vocabulary size and review length to compare the classification performance with the baseline model.
Table 5. Mean and standard deviation for classification performance on the different vocabulary sizes and reviews length for the CNN–BiLSTM hybrid model.
We set the optimal number of words and review length and compared the CNN–BiLSTM hybrid model with the baseline to evaluate the classification performance. We experimented five times and reported that the mean and the standard deviation of the classification performance are shown in Figure 5. The CNN–BiLSTM hybrid model outperforms other baseline models with an accuracy of 86.71% and F1-Score of 86.43%. Although the CNN single model represents an excellent classification effect, the other deep learning models are better than CNN. Compared to CNN, BiLSTM single models, the CNN–BiLSTM hybrid model shows the advantages of combined networks in semantic representation extraction. Because CNN models for word vectors are conducive to reprocessing CNN feature extraction by BiLSTM, we can find that adding an attention mechanism to the combination model can effectively enhance classification performance. The attention mechanism helps the model learning essential features by assigning different weights and learning differences between different features.
Figure 5. Performance comparison of classification techniques.

4.4.2. Prediction Performance Comparison Based on Helpful Review Filtering

This session identifies the effectiveness of the framework proposed in this study. Firstly, we have classified whether the new reviews written by users were helpful through the CNN-BiLSTM hybrid model. Then, we have produced a new user profile by filtering only helpful reviews. Comparing the existing recommendation methodology with the proposed RHRM framework in this study with the prediction performance through UBCF, SVD, and NCF techniques, respectively, are shown in Figure 6, Figure 7 and Figure 8, where “Existing” represents a traditional recommendation methodology that produces user profiles, including all reviews. “Proposal” represents a proposed recommendation framework that produces user profiles, which includes only helpful reviews. We have set the neighbor sizes from 1 to 100 to evaluate the prediction performance of changing the neighbor size in the UBCF technique. We set the latent factor sizes of SVD and NCF techniques to 8, 16, 32, 64, and 128 and compared the prediction performance. MAE and RMSE metrics are used to measure the prediction performance for the error between predict rating and actual ratings.
Figure 6. Prediction performance of MAE (a) and RMSE (b) of UBCF model.
Figure 7. Prediction performance of MAE (a) and RMSE (b) of SVD model.
Figure 8. Prediction performance of MAE (a) and RMSE (b) of NCF model.
The results of the experiment show that the prediction performance of the proposed recommendation framework has improved regardless of the neighbor size and number of latent factors. When we have applied both MAE and RMSE metrics for the UBCF technique, both metrics showed excellent prediction performance regardless of the neighbor size. It showed the best prediction performance when neighbor size is 10. The SVD and NCF technique indicate excellent prediction performance when the number of latent factors is 8 and 32, respectively. Therefore, we have compared the proposed framework to the existing methodology: when using the MAE metric, the prediction performance improved 14.95% (UBCF), 14.99% (SVD), and 22.08% (NCF), respectively. Similarly, using the RMSE metric, the prediction performance improved 15.38% (UBCF), 16.58% (SVD), and 21.59% (NCF). Experiments show that producing user profiles using only helpful reviews results in a better prediction performance than the existing methodology. Therefore, reflecting review helpfulness information on personalized recommendation services can improve the performance of recommender systems, which we have further conducted two-sample t-tests as shown in Table 6 to confirm that all improvements were statistically significant for p < 0.01.
Table 6. Two samples t-tests between existing and proposed framework in several recommendation method types.

5. Conclusions

5.1. Discussion

We propose a novel RHRM recommendation framework that filters only helpful reviews and reflects them in the personalized recommendation service. To achieve our study objective, we built CNN-BiLSTM hybrid models that demonstrate excellent classification performance in NLP studies to filter helpful reviews. We have also evaluated the performance of the proposed recommendation framework in this study by utilizing UBCF, SVD, and NCF techniques that are widespread in the use of recommender systems studies. To evaluate the recommendation performance, we used large numbers of Amazon publicly accessible datasets [62,63]. The experimental results show that the RHRM framework outperforms the prediction performance of the existing recommendation framework without regard to review helpfulness. Experimental results also suggest that the review’s helpfulness information can significantly impact user preference ratings. In other words, users’ high quality of reviews can provide higher reliability than preference rating information given by users [14]. Furthermore, we have identified that the CNN-BiLSTM hybrid model used in this study outperforms other deep learning models such as CNN and Bi-LSTM single model. This demonstrates the advantages of the CNN-BiLSTM hybrid model in semantic feature extraction. We have also identified that the classification performance of the CNN-BiLSTM hybrid model depends on the vocabulary size and the review length used in model training. We found that when the word size was 80,000, and the review length was maximum, and the model indicates excellent classification performance through various experiments. Because using all vocabulary as training data includes noise features that are insignificant to the analysis, this result in increased computational costs, time, and reduced classification performance [17].

5.2. Theoretical Contributions and Practical Implications

In this study, we have enhanced the recommendation performance by analyzing review helpfulness information through deep learning techniques and reflecting them in recommender systems. The theoretical implications of this study are as follows: First, the existing studies on personalized recommendation services used all reviews included in the item to extract the sentiment features and reflect them in the recommender system. However, user-written reviews include advertisements, falsehoods, and unknown content reviews [69]. In other words, if reviews are irrelevant to items and unhelpful to users, they can reduce recommendation performance. Therefore, we proposed a recommendation framework to classification review helpfulness information and reflect them in recommender systems. In this study, we have improved the recommender systems’ performance by using the review helpfulness information. This result can contribute to the extended scope of the personalized recommendation service-related studies. Second, to evaluate the recommender system’s performance considering the review helpfulness information, we compared the results considering the review helpfulness as well as not considering the review helpfulness. The experimental results showed that the recommendation performance was higher when considering review helpfulness information. Therefore, besides features, price, and users’ sentiment, the review helpfulness information is essential in purchasing decision making. Furthermore, objective information such as the number of review helpfulness votes influences users’ preference more than subjective user-written reviews.
The practical implications of this study are as follows: First, we proposed a recommendation framework that classified the review helpfulness information and reflected them in the personalized recommendation services. We conducted several experiments and found that considering helpful reviews can enhance recommendation performance over traditional methods. Most e-commerce websites provide a module for writing reviews of items purchased by users. Nonetheless, few e-commerce websites have reflected the helpfulness information in reviews. Therefore, it needs to provide services that could evaluate the review helpfulness information. For example, if the review helpfulness information is evaluated with a high score, it can increase the review information value of the item by providing users with mileage or coupons. Second, most e-commerce websites have focused on item reviews and encouraged users to write reviews on items. We found that the quality of the review is more critical than the number of reviews when providing personalized recommendation services. Therefore, rather than increasing the number of reviews, it requires a strategy that encourages users to write high-quality reviews. Finally, the proposed recommendation framework in this study can apply to the various domains of e-commerce websites that provide review usefulness information. This enables the website to build more sophisticated recommendation services, providing decision support in many aspects, including marketing and user management. Therefore, e-commerce websites can increase the convenience and satisfaction of the users and expect sales growth.

5.3. Limitations and Future Study

We have classified the review helpfulness information through the CNN-BiLSTM hybrid model. We then conducted an experiment based on the proposed recommendation framework to evaluate the recommendation performance. The limitations of this study are as follows: First, we only used Amazon publicly accessible book datasets. We built a CNN-BiLSTM hybrid model using all the datasets without classifying the book category. However, users may have different preferences depending on the book category. In future studies, it is necessary to classify book categories and measure additional recommendation effects. Additionally, applying the proposed recommendation framework to other domains must be evaluated using datasets from multiple domains. Second, to classify the review helpfulness information, we applied the CNN-BiLSTM hybrid model that showed excellent performance in NLP studies. Recently, BERT, ELECTRA, and GPT-3 models have shown excellent performance in NLP studies. Therefore, future study needs to compare the performance of multiple deep learning models. Third, we proposed a recommendation framework that classifies reviews helpfulness information and then builds users’ profiles with helpful reviews to provide recommendation services. In other words, we proposed a framework that only used the review helpfulness information. However, considering item features, purchase history, and other information would further improve recommendation performance. Finally, the early-written review received more helpful votes than the later written review. As this may create a sequential bias problem, future studies should consider the dates of the written review.

Author Contributions

Conceptualization, J.K. and Q.L.; methodology, Q.L. and X.L.; data curation, Q.L. and X.L.; writing—original draft preparation, Q.L. and B.L.; writing—review and editing, Q.L. and J.K.; supervision, J.K. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the BK21 FOUR Program (5199990913932) was funded by the Ministry of Education (MOE, Korea) and the National Research Foundation of Korea (NRF).

Institutional Review Board Statement

Not applicable.

Data Availability Statement

The data are available on http://jmcauley.ucsd.edu/data/amazon/ (accessed on 1 May 2021).

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Lu, J.; Wu, D.; Mao, M.; Wang, W.; Zhang, G. Recommender system application developments: A survey. Decis. Support Syst. 2015, 74, 12–32. [Google Scholar] [CrossRef]
  2. Bobadilla, J.; Ortega, F.; Hernando, A.; Gutiérrez, A. Recommender systems survey. Knowl.-Based Syst. 2013, 46, 109–132. [Google Scholar] [CrossRef]
  3. Das, A.S.; Datar, M.; Garg, A.; Rajaram, S. Google news personalization: Scalable online collaborative filtering. In Proceedings of the 16th International Conference on World Wide Web, Banff, AB, Canada, 8–12 May 2007; pp. 271–280. [Google Scholar]
  4. Linden, G.; Smith, B.; York, J. Amazon.com recommendations: Item-to-item collaborative filtering. IEEE Internet Comput. 2003, 7, 76–80. [Google Scholar] [CrossRef]
  5. Bennett, J.; Lanning, S. The netflix prize. In Proceedings of the KDD Cup and Workshop, San Jose, CA, USA, 12 August 2007; p. 35. [Google Scholar]
  6. Lee, D.; Hosanagar, K. How do recommender systems affect sales diversity? A cross-category investigation via randomized field experiment. Inf. Syst. Res. 2019, 30, 239–259. [Google Scholar] [CrossRef]
  7. Kim, J.; Choi, I.; Li, Q. Customer satisfaction of recommender system: Examining accuracy and diversity in several types of recommendation approaches. Sustainability 2021, 13, 6165. [Google Scholar] [CrossRef]
  8. Kim, H.K.; Ryu, Y.U.; Cho, Y.; Kim, J.K. Customer-driven content recommendation over a network of customers. IEEE Trans. Syst. Man Cybern.-Part A Syst. Hum. 2011, 42, 48–56. [Google Scholar] [CrossRef]
  9. Choi, K.; Yoo, D.; Kim, G.; Suh, Y. A hybrid online-product recommendation system: Combining implicit rating-based collaborative filtering and sequential pattern analysis. Electr. Commer. Res. Appl. 2012, 11, 309–317. [Google Scholar] [CrossRef]
  10. Park, D.H.; Kim, H.K.; Choi, I.Y.; Kim, J.K. A literature review and classification of recommender systems research. Expert Syst. Appl. 2012, 39, 10059–10072. [Google Scholar] [CrossRef]
  11. Kim, H.K.; Kim, J.K.; Ryu, Y.U. Personalized recommendation over a customer network for ubiquitous shopping. IEEE Trans. Serv. Comput. 2009, 2, 140–151. [Google Scholar] [CrossRef]
  12. Ricci, F.; Rokach, L.; Shapira, B. Introduction to recommender systems handbook. In Recommender Systems Handbook; Springer: Berlin/Heidelberg, Germany, 2011; pp. 1–35. [Google Scholar]
  13. Li, X.; Wang, M.; Liang, T.-P. A multi-theoretical kernel-based approach to social network-based recommendation. Decis. Support Syst. 2014, 65, 95–104. [Google Scholar] [CrossRef]
  14. Qiu, L.; Gao, S.; Cheng, W.; Guo, J. Aspect-based latent factor model by integrating ratings and reviews for recommender system. Knowl. -Based Syst. 2016, 110, 233–243. [Google Scholar] [CrossRef]
  15. Moore, S.G. Attitude predictability and helpfulness in online reviews: The role of explained actions and reactions. J. Consum. Res. 2015, 42, 30–44. [Google Scholar] [CrossRef]
  16. Srifi, M.; Oussous, A.; Ait Lahcen, A.; Mouline, S. Recommender systems based on collaborative filtering using review texts—A survey. Information 2020, 11, 317. [Google Scholar] [CrossRef]
  17. Ge, S.; Qi, T.; Wu, C.; Wu, F.; Xie, X.; Huang, Y. Helpfulness-aware review based neural recommendation. CCF Trans. Pervasive Comput. Interact. 2019, 1, 285–295. [Google Scholar] [CrossRef]
  18. Hu, Y.-H.; Chen, Y.-L.; Chou, H.-L. Opinion mining from online hotel reviews–a text summarization approach. Inf. Process. Manag. 2017, 53, 436–449. [Google Scholar] [CrossRef]
  19. Kaushik, K.; Mishra, R.; Rana, N.P.; Dwivedi, Y.K. Exploring reviews and review sequences on e-commerce platform: A study of helpful reviews on Amazon. in. J. Retail. Consum. Serv. 2018, 45, 21–32. [Google Scholar] [CrossRef]
  20. Castelli, M.; Manzoni, L.; Vanneschi, L.; Popovič, A. An expert system for extracting knowledge from customers’ reviews: The case of amazon. com, inc. Expert Syst. Appl. 2017, 84, 117–126. [Google Scholar] [CrossRef]
  21. Na, H.; Nam, K. Application of diversity of recommender system according to user preference change. J. Intell. Inf. Syst. 2020, 26, 67–86. [Google Scholar]
  22. Paradarami, T.K.; Bastian, N.D.; Wightman, J.L. A hybrid recommender system using artificial neural networks. Expert Syst. Appl. 2017, 83, 300–313. [Google Scholar] [CrossRef]
  23. Kim, H.K.; Oh, H.Y.; Gu, J.C.; Kim, J.K. Commenders: A recommendation procedure for online book communities. Electron. Commer. Res. Appl. 2011, 10, 501–509. [Google Scholar] [CrossRef]
  24. Lee, Y.; Won, H.; Shim, J.; Ahn, H. A hybrid collaborative filtering-based product recommender system using search keywords. J. Intell. Inf. Syst. 2020, 26, 151–166. [Google Scholar]
  25. Su, X.; Khoshgoftaar, T.M. A survey of collaborative filtering techniques. Adv. Artif. Intell. 2009, 2009, 1–19. [Google Scholar] [CrossRef]
  26. Al-Bashiri, H.; Abdulgabber, M.A.; Romli, A.; Kahtan, H. An improved memory-based collaborative filtering method based on the TOPSIS technique. PLoS ONE 2018, 13, e0204434. [Google Scholar] [CrossRef]
  27. Elahi, M.; Ricci, F.; Rubens, N. A survey of active learning in collaborative filtering recommender systems. Comput. Sci. Rev. 2016, 20, 29–50. [Google Scholar] [CrossRef]
  28. Breese, J.S.; Heckerman, D.; Kadie, C. Empirical analysis of predictive algorithms for collaborative filtering. In Proceedings of the 14th Conference on Uncertainty in Artificial Intelligence, Morgan Kaufmann, San Francisco, CA, USA, 24–26 July 1998; pp. 43–52. [Google Scholar]
  29. Isinkaye, F.O.; Folajimi, Y.O.; Ojokoh, B.A. Recommendation systems: Principles, methods and evaluation. Egypt. Inform. J. 2015, 16, 261–273. [Google Scholar] [CrossRef]
  30. Bang, H.; Lee, H.; Lee, J.-H. TV Program recommender system using viewing time patterns. J. Korean Inst. Intell. Syst. 2015, 25, 431–436. [Google Scholar] [CrossRef][Green Version]
  31. Guy, I.; Mejer, A.; Nus, A.; Raiber, F. Extracting and ranking travel tips from user-generated reviews. In Proceedings of the 26th International Conference on World Wide Web, Perth, Australia, 3–7 April 2017; pp. 987–996. [Google Scholar]
  32. Leung, C.W.; Chan, S.C.; Chung, F.-l. Integrating collaborative filtering and sentiment analysis: A rating inference approach. In Proceedings of the ECAI 2006 Workshop on Recommender Systems, Riva del Garda, Italy, 28 August–1 September 2006; pp. 62–66. [Google Scholar]
  33. García-Cumbreras, M.Á.; Montejo-Ráez, A.; Díaz-Galiano, M.C. Pessimists and optimists: Improving collaborative filtering through sentiment analysis. Expert Syst. Appl. 2013, 40, 6758–6765. [Google Scholar] [CrossRef]
  34. Zhang, Z.; Zhang, D.; Lai, J. urCF: User review enhanced collaborative filtering. In Proceedings of the 20th Americas Conference on Information Systems, Savannah, GA, USA, 7–9 August 2014; pp. 1–14. [Google Scholar]
  35. Zhou, L.; Chaovalit, P. Ontology-supported polarity mining. J. Am. Soc. Inf. Sci. Technol. 2008, 59, 98–110. [Google Scholar] [CrossRef]
  36. Jeon, B.; Ahn, H. A collaborative filtering system combined with users’ review mining: Application to the recommendation of smartphone apps. J. Intell. Inf. Syst. 2015, 21, 1–18. [Google Scholar]
  37. Hyun, J.; Ryu, S.; Lee, S.-Y.T. How to improve the accuracy of recommendation systems: Combining ratings and review texts sentiment scores. J. Intell. Inf. Syst. 2019, 25, 219–239. [Google Scholar]
  38. Cheng, Z.; Ding, Y.; Zhu, L.; Kankanhalli, M. Aspect-aware latent factor model: Rating prediction with ratings and reviews. In Proceedings of the 2018 World Wide Web Conference, Lyon, France, 23–27 April 2018; pp. 639–648. [Google Scholar]
  39. Gan, C.; Feng, Q.; Zhang, Z. Scalable multi-channel dilated CNN–BiLSTM model with attention mechanism for Chinese textual sentiment analysis. Future Gener. Comput. Syst. 2021, 118, 297–309. [Google Scholar] [CrossRef]
  40. Deng, J.; Cheng, L.; Wang, Z. Attention-based BiLSTM fused CNN with gating mechanism model for Chinese long text classification. Comput. Speech Lang. 2021, 68, 101182. [Google Scholar] [CrossRef]
  41. Stojanovski, D.; Strezoski, G.; Madjarov, G.; Dimitrovski, I.; Chorbev, I. Deep neural network architecture for sentiment analysis and emotion identification of Twitter messages. Multimed. Tools Appl. 2018, 77, 32213–32242. [Google Scholar] [CrossRef]
  42. Song, Y.; Hu, Q.V.; He, L. P-CNN: Enhancing text matching with positional convolutional neural network. Knowl.-Based Syst. 2019, 169, 67–79. [Google Scholar] [CrossRef]
  43. Abdi, A.; Shamsuddin, S.M.; Hasan, S.; Piran, J. Deep learning-based sentiment classification of evaluative text based on Multi-feature fusion. Inf. Process. Manag. 2019, 56, 1245–1259. [Google Scholar] [CrossRef]
  44. Rao, G.; Huang, W.; Feng, Z.; Cong, Q. LSTM with sentence representations for document-level sentiment classification. Neurocomputing 2018, 308, 49–57. [Google Scholar] [CrossRef]
  45. Hassan, A.; Mahmood, A. Efficient deep learning model for text classification based on recurrent and convolutional layers. In Proceedings of the 2017 16th IEEE International Conference on Machine Learning and Applications (ICMLA), Cancun, Mexico, 18–21 December 2017; pp. 1108–1113. [Google Scholar]
  46. Hassan, A.; Mahmood, A. Convolutional recurrent deep learning model for sentence classification. IEEE Access 2018, 6, 13949–13957. [Google Scholar] [CrossRef]
  47. Liu, G.; Guo, J. Bidirectional LSTM with attention mechanism and convolutional layer for text classification. Neurocomputing 2019, 337, 325–338. [Google Scholar] [CrossRef]
  48. Batbaatar, E.; Li, M.; Ryu, K.H. Semantic-emotion neural network for emotion recognition from text. IEEE Access 2019, 7, 111866–111878. [Google Scholar] [CrossRef]
  49. Zheng, J.; Zheng, L. A hybrid bidirectional recurrent convolutional neural network attention-based model for text classification. IEEE Access 2019, 7, 106673–106685. [Google Scholar] [CrossRef]
  50. Liu, Z.-x.; Zhang, D.-g.; Luo, G.-z.; Lian, M.; Liu, B. A new method of emotional analysis based on CNN–BiLSTM hybrid neural network. Clust. Comput. 2020, 23, 2901–2913. [Google Scholar] [CrossRef]
  51. Rai, A.; Shrivastava, A.; Jana, K.C. A CNN-BiLSTM based deep learning model for mid-term solar radiation prediction. Int. Trans. Electr. Energy Syst. 2020, 31, e12664. [Google Scholar] [CrossRef]
  52. Jang, B.; Kim, M.; Harerimana, G.; Kang, S.-u.; Kim, J.W. Bi-LSTM model to increase accuracy in text classification: Combining Word2vec CNN and attention mechanism. Appl. Sci. 2020, 10, 5841. [Google Scholar] [CrossRef]
  53. Rhanoui, M.; Mikram, M.; Yousfi, S.; Barzali, S. A CNN-BiLSTM model for document-level sentiment analysis. Mach. Learn. Knowl. Extr. 2019, 1, 832–847. [Google Scholar] [CrossRef]
  54. Cao, R.; Zhang, X.; Wang, H. A review semantics based model for rating prediction. IEEE Access 2019, 8, 4714–4723. [Google Scholar] [CrossRef]
  55. Mitra, S.; Jenamani, M. Helpfulness of online consumer reviews: A multi-perspective approach. Inf. Process. Manag. 2021, 58, 102538. [Google Scholar] [CrossRef]
  56. Chen, T.; Xu, R.; He, Y.; Wang, X. Improving sentiment analysis via sentence type classification using BiLSTM-CRF and CNN. Expert Syst. Appl. 2017, 72, 221–230. [Google Scholar] [CrossRef]
  57. Kim, Y. Convolutional neural networks for sentence classification. In Proceedings of the EMNLP, Doha, Qatar, 25–29 October 2014. [Google Scholar]
  58. Ekstrand, M.D.; Riedl, J.T.; Konstan, J.A. Collaborative Filtering Recommender Systems; Now Publishers Inc.: Norwell, MA, USA, 2011. [Google Scholar]
  59. Herlocker, J.L.; Konstan, J.A.; Terveen, L.G.; Riedl, J.T. Evaluating collaborative filtering recommender systems. ACM Trans. Inf. Syst. (TOIS) 2004, 22, 5–53. [Google Scholar] [CrossRef]
  60. He, X.; Liao, L.; Zhang, H.; Nie, L.; Hu, X.; Chua, T.-S. Neural collaborative filtering. In Proceedings of the 26th International Conference on World Wide Web, Perth, Australia, 3–7 April 2017; pp. 173–182. [Google Scholar]
  61. Zhang, S.; Yao, L.; Sun, A.; Tay, Y. Deep learning based recommender system: A survey and new perspectives. ACM Comput. Surv. (CSUR) 2019, 52, 1–38. [Google Scholar] [CrossRef]
  62. He, R.; McAuley, J. Ups and downs: Modeling the visual evolution of fashion trends with one-class collaborative filtering. In Proceedings of the 25th International Conference on World Wide Web, Montreal, QC, Canada, 11–15 April 2016; pp. 507–517. [Google Scholar]
  63. McAuley, J.; Targett, C.; Shi, Q.; Van Den Hengel, A. Image-based recommendations on styles and substitutes. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, Santiago, Chile, 9–13 August 2015; pp. 43–52. [Google Scholar]
  64. Liu, Y.; Huang, X.; An, A.; Yu, X. Modeling and predicting the helpfulness of online reviews. In Proceedings of the 2008 Eighth IEEE International Conference on Data Mining, Pisa, Italy, 15–19 December 2008; pp. 443–452. [Google Scholar]
  65. Park, S.; Woo, J. Gender classification using sentiment analysis and deep learning in a health web forum. Appl. Sci. 2019, 9, 1249. [Google Scholar] [CrossRef]
  66. Yoo, S.; Song, J.; Jeong, O. Social media contents based sentiment analysis and prediction system. Expert Syst. Appl. 2018, 105, 102–111. [Google Scholar] [CrossRef]
  67. Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
  68. Yang, L.; Li, Y.; Wang, J.; Sherratt, R.S. Sentiment analysis for E-commerce product reviews in Chinese based on sentiment lexicon and deep learning. IEEE Access 2020, 8, 23522–23530. [Google Scholar] [CrossRef]
  69. Saumya, S.; Singh, J.P. Detection of spam reviews: A sentiment analysis approach. CSI Trans. ICT 2018, 6, 137–148. [Google Scholar] [CrossRef]
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.