Joint Deep Model with Multi-Level Attention and Hybrid-Prediction for Recommendation

The Recommender System (RS) has obtained a pivotal role in e-commerce. To improve the performance of RS, review text information has been extensively utilized. However, it is still a challenge for RS to extract the most informative feature from a tremendous amount of reviews. Another significant issue is the modeling of user–item interaction, which is rarely considered to capture high- and low-order interactions simultaneously. In this paper, we design a multi-level attention mechanism to learn the usefulness of reviews and the significance of words by Deep Neural Networks (DNN). In addition, we develop a hybrid prediction structure that integrates Factorization Machine (FM) and DNN to model low-order user–item interactions as in FM and capture the high-order interactions as in DNN. Based on these two designs, we build a Multi-level Attentional and Hybrid-prediction-based Recommender (MAHR) model for recommendation. Extensive experiments on Amazon and Yelp datasets showed that our approach provides more accurate recommendations than the state-of-the-art recommendation approaches. Furthermore, the verification experiments and explainability study, including the visualization of attention modules and the review-usefulness prediction test, also validated the reasonability of our multi-level attention mechanism and hybrid prediction.


Introduction
With dramatically increasing information available online, the Recommender System (RS) is playing an increasingly important role in many customer-oriented applications and services to alleviate information overload. Facing information explosion, Internet users rely on RS to filter out uninteresting information and direct access helpful information in classical application scenarios such as listening music, watching movie and online shopping. Meanwhile, recommender systems can also bring traffic and make profit for online companies. For example, it is reported in [1] that recommender systems in Netflix accounted for about 80% of movies watched and generated immense economic value of more than $1 billion (this paper is an extended version to our previous conference publication [2]).
The key of recommender systems lays in the modeling of users' preference [3]. Collaborative Filtering (CF) plays a pivotal role in modern recommender systems and many promising approaches in the literature adopt the idea of CF to predict the users' preference [4]. The basic assumption of CF is that the users who have similar behavior such as historical clicking or rating tend to share similar preference and make the same choice. As a representative approach in CF, Probabilistic Matrix Factorization (PMF) has become the most famous and popular technique in RS [5]. PMF projects the uses and items into a common latent space and constructs the latent vector through factorization of rating matrix. The elements of a latent vector can be seen as contributions of underlying factors to the preference of a user or the property of an item. In PMF, the preference of users on items is modeled as the inner product of corresponding vectors.
Although the CF is a clever and compact idea, there are still several significant challenges in practical scenarios [6]. The first one comes from the sparsity of rating matrix [7]. The sparsity problem arises from the imbalance between the total number of items and the average number of items that a user can rate. Since there is plenty of missing value on the rating matrix, it is perplexing for CF to make prediction and give recommendation. In addition, CF techniques show weak performance in the ability to provide explanation. Some research [8,9] has pointed out that it is beneficial for RS to give some explanation for the sake of persuasion and convenience. Credible explanations on the recommended items can increase the users' confidence and help them make decision more efficiently. Another dilemma is caused by the model of the interaction between users' and items' latent vectors. The traditional factorization approaches in RS, including Matrix Factorization (MF) [10], Factorization Machine (FM) [11] and Pairwise Interaction Tensor Factorization (PITF) [12], apply an inner product to model the pair-wise interaction between latent vectors and make rating prediction. However, as shown in [13,14], the simple inner product may miss high-order and non-linear parts in user-item interactions and lead to a deviation from original rating.
To address the sparsity problem and improve the explainability of RS, we explore the use of review text in RS. On most online commerce websites, users are encouraged to not only give a numerical rating on items but also write text reviews. The reviews contain rich information that can reflect the property and feature of items to some extent, which can give some suggestions and has certain reference significance to other users [15]. Recently, some studies have reported that importing the review text into RS can mitigate the effect of sparsity problem and strengthen the ability of providing explanation [16,17].
There are many models to extract the feature vector of review text in RS. However, it is still a challenge and significant issue for RS to extract the most informative feature from a tremendous amount of reviews. Existing RS models utilize reviews to boost the model of latent factor [17][18][19][20] and give explanation [21][22][23]. Although these methods have achieved impressive improvement, they still have some limitations in the design of attention mechanism. First, they do not model the contribution of each review and that of words in a single review text jointly. Some methods focus on the extraction of keywords and phrases but miss the contribution of each review [18,19], while some approaches only consider the usefulness of reviews [21]. Second, these methods mainly generate explanations through simple extractions of keywords or reviews, which may mislead users.
In real life, people are not only attracted to the keywords or phrases when reading review text but also focus on the useful reviews, especially those reflecting the property of items in plenty of reviews. Inspired by this common phenomenon, we design a multi-level attention mechanism for RS to extract most helpful information from all related reviews. The weights of words and reviews are defined as the informativeness and the usefulness of themselves. The weights of reviews is a extension of the definition of usefulness in [21] and the weights of words is an extension of the weights in [23]. To provide helpful information about items and give suggestions based on the interest of users, the weights are learned according to the preference of users or the property of items in two deep neural networks.
To capture the high-order and non-linear user-item interaction and avoid a deviation from original rating, we combine Deep Neural Network with FM [11]. As a popular method to extract feature representation, deep neural networks show the potential to handle the sophisticated feature interaction in the prediction of click-through rate (CTR) [14,[24][25][26]. FM captures pair-wise interactions between latent factor vectors and shows impressive results. We extend the coupling of FM and deep neural networks in CTR [14,26] to the prediction layer of our RS model to capture both high-order and low-order interactions.
In this paper, we propose a novel Multi-level Attentional and Hybrid-prediction-based Recommender (MAHR) model. Specifically, we apply both word level and full-text level attention mechanisms on reviews through two deep neural networks, which extract feature vectors for each user and item. The word level attention mechanism is realized by a CNN text processor with a word-level attention layer, which learns the weights of words from a slip window. The full-text level attention mechanism applies a two-layered neural network with softmax function to measure the weights of reviews. In the prediction layer, we integrate FM and deep neural networks to model low-order user-item interactions as in FM and capture the high-order interactions as in deep neural networks. Experimental results show that our MAHR model consistently outperforms state-of-the-art methods. In addition, the results also show that and MAHR alleviates the sparsity problem and improve the explainability of RS.
The main contributions of this paper are: • We propose a Multi-level Attentional and Hybrid-prediction-based Recommender (MAHR) model. To the best of our knowledge, we are the first to introduce a multi-level attention mechanism in RS, which alleviates the sparsity problem and improve the explainability of RS.

•
In prediction layer, we hybrid FM and DNN to overcome the shortcomings of traditional handling of user-item interactions by simple inner product and capture both high-order and low-order interactions to achieve better prediction performance.

•
Experimental results on benchmark datasets show that our MAHR model obtains better prediction results than the state-of-the-art models and the average relative improvement over PMF is up to 30.06%.
The extension of our journal paper to our previous conference publication [2] is as follows. In this article, we add the ID embedding of users and items to model the quality of users and items, which helps to identify users and items. In addition, the hybrid prediction layer is tuned to improve the memory efficiency. For the completeness of the model, we give a more precise description and analysis of our proposed method (Section 3). To make our experiments more solid, we extended our experiments from the comparison methods (CTR and D-Attn in Section 4.1) to the parameter sensitivity analysis (Figures in Section 4.3). We also detail the experimental methodology in Section 4.2. The discussion on comparison performance (Table in Section 4.4) is extended in a variable-controlling approach (Section 4.4). We also carried a test to compare our full-text level mechanism quantitatively (Table in Section 4.7).
The reminder of our work is organized as follows. We present related work in Section 2. Section 3 describes our models in detail. Subsequently, Section 4 shows experimental analysis, followed by the conclusion and future work in Section 5.

Related Work
In recent years, deep learning has shown great improvement in pattern recognition and other application fields [27][28][29][30]. In this section, we introduce the related techniques on the application of deep learning in the field of RS.

Attention Mechanism in RS
The idea of attention mechanism is inspired from the human's ability to allocate attention on the observed objects according to tasks while ignoring the impact of others [31]. Thus far, the attention mechanism has been applied widely in many tasks of deep learning, such as information retrieval [32], recommendation [33] and machine translation [34]. The core of soft attention is to train an efficient and effective neural network to assign attentive weights for features vectors in an end-to-end manner. In the field of RS, He et al. [35] introduced an attention mechanism in CF to learn to assign both component-level and item-level attention weights on a set of features for multimedia recommendation. As for review text, Chen et al. [21] adopted attention mechanism to learn the usefulness of each review and extract the feature vector of users and items for predicting item ratings and generating explanation. In addition, Seo et al. [23] used convolutional neural networks with dual local and global attention networks to model user preferences and item properties. The attention layer selects keywords that account for to the rating by a local or global window. In this paper, we propose a multi-level attention mechanism that consists of word level attention layer and full-text attention layer to better model users and items for predicting item ratings.

Capturing High-and Low-Order User-Item Interactions
The coupling of FM and DNN has first introduced in click-through rate (CTR) prediction. DNN has shown effectiveness in CTR for learning sophisticated feature interactions [36]. Traditional approaches such as FM give a close approximation to the low-order feature interactions [11]. Many studies on the modeling of both high-order and low-order interactions [14,25,26] have been carried out. Cheng et al. proposed a network consisted of linear part and deep part to capture high-and low-order interactions. To leverage FM, Zhang et al. proposed FNN [25], which is a feed-forward neural network to model user-item interactions with an FM-initialized input. Based on the model mentioned above, Guo et al. proposed an end-to-end DeepFM model without any pre-train or feature engineering. DeepFM makes the "FM part" and "deep part" share the same input and shows impressive performance in CTR. In this paper, we follow the model of DeepFM and design a hybrid prediction structure for rating predictions. The main difference between our hybrid prediction structure and DeepFM is that the input of our model is the extracted feature vector from attention-based neural networks while that of DeepFM is the raw and sparse feature vector. Thus, our prediction structure does not include dense embedding layer, which enhances the memory efficiency of our model.

Overview
To select both keywords and useful reviews, we utilize the multi-level attention mechanism to automatically assign weights to reviews and words when modeling users and items. Figure 1 illustrates the architecture of MAHR. The model contains two similar attention-based neural networks for users (N et u ) and items (N et i ) respectively, after which is the vector concatenation layer and the hybrid prediction layer. Note that the hybrid prediction layer consists of Factorization Machine and Deep Neural Network to capture the feature of concatenated vector more efficiently. Reviews written by the same users or on the same items are fed into to N et u and N et i . The corresponding rating is calculated as the sum of FM and DNN (we can also use other methods to combine the output of FM and DNN). In the following subsections, due to the similarity between N et u and N et i , we only give the details N et i . The same analysis is also practicable for N et u .
In the first stage of N et i , the word-attention CNN Processor is applied to process the textual reviews of item i. In this processor, the network processes each review of item i, respectively. Specifically, each review is first transformed into a matrix of word vectors by the word-embedding technique, which we denote as V i1 , V i2 , · · · , V ik . Then, these matrices are sent to word attention CNN and the feature vectors of them can be obtained from the output. These feature vectors are denoted as O i1 , O i2 , · · · , O ik . Note that the keywords obtain higher weights in the word-attention CNN Processor, which will result more in precise feature vector of reviews than the normal CNN without word level attention mechanism.
The second stage of N et i is the full-text-attention layer. The inputs are the feature vectors from word-attention CNN and the respective ID embedding of users who writes reviews for item i. In the full-text-attention layer, we calculate the contribution of each review and aggregate these vectors to get the representation of item i: where a il is the corresponding weight of the l-th review for item i, which will learn by the full-text-attention layer.
After we obtain the feature vector of item i, we pose a fully connected layer to customize the dimension of vectors according to the number of latent factors. The output of N et i is the latent vector of item i, which is denoted as Y i . Recall that the output of left network N et u is the latent vector of user u, which is denoted as X u .
The final structure is the prediction layer. In MAHR, we adopt the assumption of latent factor model that the rating of users to items are the results of the interaction of users' and items' latent vector. We extend the origin handle of user-item interaction by replacing the inner product with the coupling of FM and DNN. The objective function is the 2 -norm of recorded rating matrix. To avoid overfitting, we adopt the dropout technique to improve the generalization ability of MAHR.

Word-Level Attention-Based Mechanism
Inspired by the human visual attention, we further develop a word attention mechanism for reviews. When we read text or see images, we probably focus on a certain part of the input to understand or recognize them more efficiently. Generally, people tend to be attracted by the significant word in a local part. Here, we introduce a word-level attention module to measure the significance of one word.
Suppose X is a review with T word embeddings (x 1 , x 2 , · · · , x T ) and X att,i is a slip window. Here, x i is the center word and w is the width of local window. The significance for each center word in window is given by: where s i is the informative score of the i-th word and W 1 att ∈ R w×d is a parameter matrix and b 1 att is a bias. The * is the operation of element-wise multiplication and sum. The score can be direct used as a weight for i-th word embedding or we can apply a threshold to remove "trivial" words and only consider informative attention words. In this work, we use scores as weights. We use sigmoid for the activation function g. The weighted word embedding is: To learn a global semantic representation, the weighted word embedding matrix X is then fed into textual CNN module, which consists of a 1-D convolutional layer and max pooling layer. The output is the feature vector O il of the l-th review in item i. The textual CNN module includes a convolutional layer and a max-pooling layer. Suppose the number of filters is n att . Let W 2 j ∈ R t×d be a filter and b 2 j be a bias where j ∈ [1, n att ]. Then, O il is obtained by: where is the convolution operation and Max is the max-pooling operation. ReLU is a nonlinear activation function. The Max(z j ) is the feature obtained by convolution filter W 2 j and max-pool layer.

Full-Text Level Attention-Based Mechanism
The goal of the Full-Text Level Attention-based Layer in N et i is to calculate the significance of one review for the features of item i and then aggregate all weighted reviews to characterize item i. A two-layer network is used for the attention score a il . The input contains the feature vector of the l-th review of item i (O il ) and the user who wrote it (ID embedding, u il ). The ID embedding is added to model the quality of users, which helps identify users who always write less-useful reviews. Formally, the attention network is defined as: h ∈ R t , and b 2 ∈ R 1 are parameters. The feature vector O il is obtained by the word-attention CNN. In addition, u il is calculated from one-hot code of user ID by the lookup operation in tensorflow. t denotes the size of hidden layer. The final weights of reviews are predicted by the softmax function to normalize the above attention scores. The contribution of the l-th review to the final feature vector of item i is given by: After we obtain the attention weight of each review, the feature vector of item i is calculated as Equation (1).
Then, O i is sent to a fully connected layer with weight matrix W 0 ∈ R n×k 1 and bias b 0 ∈ R n to customize the dimension of latent vector. The final representation of item i is given by:

Hybrid Prediction Layer
We aim to capture both linear and non-linear interactions between users and items. We develop a hybrid prediction layer in MAHR. As shown in Figure 1 , the prediction layer consists of two components, FM component and Deep component, which share the same input. First, let us concatenate X u and Y i into a single feature vector z i = (X u , Y i ). For the feature vector z i , the predicted rating R is given by: where R FM and R NN are, respectively, the output of FM component and deep component. The FM component is a factorization machine [11]. In addition to modeling first-order linear interactions among features, FM also uses a dot product between vectors to model a second-order pairwise feature interactions. The output of FM is the sum of a bias and the two kinds of interactions. R FM is given by: where w 0 is the global bias, w i measures the impact of the i-th variable in z i and < v i , v j > models the second-order interactions.
The deep component is a deep neural network for learning the high-order feature interactions. The architecture of the network is given by: where l is the depth of neural network and σ is an activation function. a (l) , W (l) , and b (l) are the input, weight matrix, and bias of the l-th layer, respectively. The prediction layer for R NN in deep component is given by: where H is the number of hidden layers and σ is the sigmoid function.
Based on DeepFM, all parameters, including w i , v i , and the network parameters (W (l) and b (l) ), are trained jointly for the combined prediction model.

Learning
The task that we focus on in this paper is rating prediction, which actually is a regression problem. For regression, a popular objective function is the squared loss. The objective function is given by: where Ω denotes the set of instances for training and R u,i is the rating assigned by user u to item i. To optimize the objective function, we adopt the Adaptive Moment Estimation (ADAM) as the optimizer. It can adjust the learning rate during the training phase, which avoids the process of choosing an efficient learning rate and results in the faster convergence than the SGD. To alleviate overfitting, we consider dropout [37], a widely used method in deep learning models. Dropout stop working during testing and we use the whole network for prediction. Through dropout, we can prevent complex coadaptations of neurons on training data. Moreover, dropout may potentially improve the performance of the whole neural network due to the side effect of performing model averaging with smaller neural networks.

Experiment
We performed extensive experiments on three popular rating datasets with reviews to verify the effectiveness of MAHR model in comparison with other state-of-the-art approaches. We first give the details of experimental settings in Section 4.1, including datasets description, comparison methods, evaluation metric, and parameter settings. The experimental methodology is presented in Section 4.2, followed by a parameter sensitivity analysis in Section 4.3. Besides, we present the performance comparison in Section 4.4. In Sections 4.5 and 4.6, we present the validation of the effectiveness of multi-level attention mechanism and hybrid prediction structure, respectively. Afterwards, the explainability analysis is presented.

Datasets
We tested MAHR methods on three popular datasets. The first dataset is Yelp Challenge 2015 (https://www.yelp.com/dataset/challenge), a dataset for restaurant ratings and comments. Yelp contains more than one million reviews and thirty thousand users. The second and third dataset are both from Amazon (http://jmcauley.ucsd.edu/data/amazon/) product data with five cores [38]: Books (8,898,041 reviews and 22,507,155 ratings) and Electronics (1,689,188 reviews and 7,824,482 ratings). These two datasets were selected to cover the most classic scenarios for online services and applications, i.e., reading books and online shopping. Table 1 shows the statistics of three datasets. In Table 1, we can find that the three datasets have different sparsity in rating matrix. All of these datasets have more than one million reviews, while users in Yelp and Electronics provide fewer reviews than Books on average, which indicates that Yelp and Electronics are much sparser than Books. As mentioned in the Introduction, the sparsity problem could deteriorate the prediction performance of recommender systems. Note that the lengths of reviews in all datasets are less than 150 words on average.

Comparison Methods
To show the superiority of our MAHR method, we selected seven comparison methods: Probabilistic Matrix Factorization (PMF) [5], Non-negative Matrix Factorization (NMF) [39], Latent Dirichlet Allocation (LDA) [40], Collaborative Topic Regression (CTR) [41], Deep Cooperative Neural Networks (DeepCoNN) [19], Neural Attentional Regression model with Review-level Explanations (NARRE) [21], and dual attention-based model (D-Attn) [23]. These methods can be divided into three categories: (1) traditional CF model without leveraging review (PMF and NMF); (2) topic modeling based approaches with the application of review information (LDA and CTR); and (3) deep recommender systems with the utilization of review (DeepCoNN, NARRE, and D-Attn). The first category was the blank group to validate whether the review information is helpful to for recommender system. The second category was the topic-model group to compare deep recommender model with topic modeling based recommender systems. The third category was the deep-model group to compare our MAHR model with other deep RS with review information. The topic-model group and deep-model group import the review information to improve the prediction performance. The characteristics of the comparison methods are listed in Table 2.
• PMF: Probabilistic Matrix Factorization models the rating matrix as the product of two lower-rank user and item matrix with a Gaussian error distribution and obtains good performance on the large, sparse, and very imbalanced datasets. • NMF: Non-negative Matrix Factorization is another useful matrix factorization with only rating matrix involved. • LDA: Latent Dirichlet Allocation is a famous topic modeling algorithm. By employing LDA on review text, we can learn a topic distribution for each item and obtain the topic preference for each user.

Parameter Settings
The dataset was randomly split into training set (80%), validation set to tune hyper-parameters (10%), and test set (10%). The hyper-parameter for comparison methods were initialized according to the corresponding papers and tuned based on the performance on the validation set to obtain optimal performance.
For PMF and NMF, we selected the optimal parameter for the number of latent factors from {10, 20, 40, 60, 80, 100}, and regularization parameter from {0.001, 0.01, 0.1, 1.0}. For LDA and CTR, we chose the number of topics from {5, 10, 20, 50, 100}. After searching, the number of latent factors for PMF and NMF were set to 60 and the number of topics for LDA and CTR was 10. Following the experiments in [19], the hyper-parameters for CTR were: α = 0.1, λ u = 0.02 and λ v = 10.
For deep models, we reused the parameters for deep neural network reported in the corresponding papers. Moreover, Google News was imported as a pre-trained word embeddings and the dimension of word embedding is 300. For MAHR model, we searched the dropout ratio from {0.1, 0.3, 0.5, 0.7, 0.9} and the number of hidden layer from {1, 2, 3, 4}. The number of latent factors was searched from {10, 20, 40, 60, 80, 100}. The number of hidden layer was 2 and the dropout ratio was 0.5. The window size for the word attention layer was 5 and the number of convolutional kernels was 100. The latent factor number was 60.

Evaluation Metric
In the experiments, we adopted the Root Mean Square Error (RMSE) to evaluate the prediction performance of our MAHR model. RMSE has been used in plenty of works on RS, including for both traditional and deep learning methods [5,21]. RMSE can be defined as follows: where TEST is the set of observation.

Methodology
We followed a four-step methodology to demonstrate the performance of our proposed MAHR model: • To explore the optimal parameter of MAHR model and the parameter sensitivity, we carried out a parameter study on the validation set. We evaluatd our approach based on the optimal parameters. • To analyze the prediction performance of MAHR, we compared the RMSE of different methods based on the test set. We made some observations based on the results.

•
To validate the effectiveness of MAHR model, we produced some variant MAHR models (MAHR-multi, MAHR-word, MAHR-full-text, MAHR-non, MAHR-hybrid, MAHR-fm, MAHR-nn, and MAHR-dot-product) by controlling the application of multi-level attention mechanism and hybrid prediction structure. We also observed the advantage of our methods in comparison with variant methods.

•
To show the explainability, we visualized our multi-level attention layers and the corresponding weights. We also calculated the prediction and recall of our full-text level attention for predicting useful reviews on Amazon datasets (Books and Electronics). The comparison results with simple strategies (random, latest, and longest) are reported.

Parameter Sensitivity Analysis
Based on the Yelp and Books dataset, we analyzed the influence of different hyper-parameters of traditional approaches and deep models including DeepCoNN, NARRE, D-Attn, and MAHR. The order was: (1) dropout rate; (2) number of hidden layers; and (3) number of latent factors.
We first studied the impact of the drop rate. We searched the dropout from {0.1, 0.3, 0.5, 0.9}. The results of deep models with respect to different dropout ratios are shown in Figure 2. we found that all methods benefited from a proper value of dropout ratio and could reach the peak of prediction performance when dropout rate was 0.5, which demonstrates that the dropout technique could prevent overfitting and improve the prediction accuracy. Basically, all of these curves witnesses a decrease in RMSE as the dropout rate oncreased after 0.5. These results prove that a proper dropout rate could enhance the robustness of model.
In addition, we also observed that the RMSE curve on Yelp dataset changed more sharply than that on Books dataset, which indicates the prediction accuracy of Yelp dataset was more sensitive to dropout rate than the prediction accuracy of Books dataset. We think the difference in the parameter sensibility was caused by the difference on the sparsity of dataset, given the common phenomenon in deep learning that a small dataset tends to be more likely overfit without dropout.
We then studied the effect of the number of hidden layers in hybrid prediction structure. Because the DeepCoNN, NARRE, and D-Attn have no hidden layers in prediction layer, we only carried out the second parameter study on MAHR. As presented in Figure 3, increasing number of hidden layers in the hybrid prediction layer raised the accuracy of the models at the beginning. However, when we kept increasing the number of hidden layers, their performance degraded, as a result of overfitting.
We finally explored the effect of the number of latent factors. The results are shown in Figure 4 For deep models group (DeepCoNN, NARRE, D-Attn, and MAHR), the number of latent factors was equal to the dimension of input vector for prediction layer. Generally, we observed that the RMSE curves of all deep methods showed a gentle and trivial change while the blank group (PMF and NMF) and topic model group (LDA and CTR) had a bigger change than deep model group. This demonstrates the benefits of deep learning for extracting the feature of review text. We found that MAHR achieved the best RMSE performance on both datasets and under all latent factor numbers, in comparison with other deep models, which shows the advantages of multi-level attention mechanism and hybrid prediction layers.

Performance Comparison
The performance of MAHR and the baselines are reported in terms of RMSE in Table 3. The best performance is shown in bold and the averages on three datasets are reported. From the results, several observations can be made.
First, we found that both traditional matrix factorization methods (PMF and NMF) in blank group did not obtain comparable performance to those methods in the control group that utilize reviews. The gap in performance between blank group methods and control group methods validated our hypothesis that review text could supply additional information and considering reviews in models further improve the accuracy of rating prediction.
Secondly, although the simple employment of topic modeling methods (LDA and CTR) to learn topic characters from item reviews could improve the performance of recommendation system, deep models group better captured the feature of review when compared with topic modeling group. By modeling ratings and reviews together and using supervised learning for regression tasks, DeepCoNN, NARRE, and MAHR obtained additional improvements. In these experiments, LDA modeled reviews without the feedback from users ratings. Thus, the learned unsupervised features from LDA might be not as efficient as the expectation. Compared with LDA, CTR obtained additional improvements by jointly modeling reviews and rating. However, CTR could not beat deep methods on all datasets because of the weaker capacity of jointly modeling and extracting semantic features. Thirdly, as shown in Table 3, our method MAHR consistently outperformed all baseline methods. Although review information was useful in recommendation, the performance varied depending on how the review information was utilized. Our model proposes a multi-level attention mechanics for extracting both word-level and full-text-level information. This allows a review to be modeled with a finer granularity, which can lead to a better performance according to the results. Compared to PMF, our approach gained 30.06% improvement on average. In Table 3, all methods obtained better performance on Book dataset than on Yelp and Electronics. We think it was caused by the sparsity of Yelp and Electronics. We found that the method using review information could effectively alleviate the sparsity problem on all datasets, which validates the hypothesis that review information can help model user and item, leading to additional improvement.

Effect of Multi-Level Attention
We next focused on validating the effeteness of multi-level attention mechanism. We produced several variant MAHR models through assigning normalized constant weights. The variant MAHR models include the original MAHR model with multi-level attention mechanism (MAHR-multi), the MAHR model with word level attention mechanism (MAHR-word), the full-text level attention mechanism one (MAHR-full-text), and the model without any attention mechanism (MAHR-non). Note that, when we did not use full-text-level attention mechanics, a normalized constant weight was assigned to each review, and, when we did not consider the word-level attention, the Word-Attention degenerated to a normal textual CNN module. For MAHR-word (MAHR-full-text) model, we assigned normalized constant weights to reviews (words). For MAHR-non, we assigned constant weights to both words and reviews. In Figure 5, we compare the average RMSE on three datasets.
In the figure, we make several observations. First, all three methods with attention layer (MAHR, MAHR-word, and MAHR-full-text) outperformed MAHR-non model, which did not apply any attention mechanism. No matter what type the attention mechanism, the prediction accuracy benefited from the extracted information in review text, which justifies the common assumption that the significance of different reviews and words vary and proper modeling of different weights results in additional improvement. Second, the multi-level attention approach made the most precise prediction, which validates the usefulness of our approach. From the better performance of the full-text level approaches compared to the word-level one, we found that the full-text-level attention has a more significant contribution to the improvement. Third, the average RMSE of MAHR-non model was lower than that of DeepCoNN, which indicates the prediction accuracy of MAHR without any attention mechanism is better than the accuracy of DeepCoNN. This improvement was probably caused by the effective hybrid prediction layer in MAHR.

Effect of Hybrid Prediction
In this section, our controlled experiment on the effect of hybrid prediction layer is presented. Three new variants were used. The first one (MAHR-dot-product) is the traditional dot-product prediction layer: R = X u Y i . The last two approaches use FM (MAHR-fm) or NN(MAHR-nn) solely as prediction layer. We implemented the MAHR-dot-product model by switching off the hybrid prediction layer and replacing the vector concatenation layer with the dot-product output layer. As for MAHR-fm and MAHR-nn, we implemented them by switching off the Deep Neural Networks and Factorization Machine, respectively. We reused the hyper-parameter setting of MAHR model in the last two variants. The average RMSE on three datasets are shown in Figure 6.
As shown in Figure 6, the MAHR-dot-product obtained the worst performance in all methods. The difference in RMSE justifies that the simple dot product for prediction layer might miss information about the non-linear interaction between users and items and lead to a loss in prediction accuracy. Furthermore, we found that the MAHR-hybrid made the most precise prediction, which validates the usefulness of our hybrid prediction layer. When compared with MAHR-FM, MAHR-NN obtained higher prediction accuracy. We think the reasons were as follows. First, the deep neural networks could capture the non-linear user-item interactions, while our implemented FM could only model the second-order interactions, which is the major limitation of prediction layer. Second, the dropout technique in deep learning could avoid potentially overfitting and obtain additional improvement. In addition, the RMSE of MAHR-dot-product was apparently lower than that of LDA and CTR, which also use the inner product of latent vectors to predict rating. The decrements in RMSE probably benefited from the robust multi-level attention mechanism.  Figure 6. Effect of hybrid prediction layer.

Word Level Explanation
To validate our design on the word-level attention module, we highlighted keywords with high weights in the attention module. Colored words were considered as informative words, and green words had higher attention scores than those of blue words. We randomly selected the same review from Yelp but highlighted differently by the user network and the item network in Figure 7.
We made two key observation. First, similar keywords were highlighted in the user network and item network. All of these keywords were likely words that describe properties of the item or some more personalized words. However, the two networks chose different attentional words, because the two networks were trained with different sets of reviews and the network decided the keywords by reviews and ID embedding. For example, the user network gave high scores to keywords expressing subject feelings, e.g., "good", "excited" and so on. The item network tended to focus on nouns that describe properties of an item.

Full-Text Level Explanation
To analyze the explainability of full-text-level attention module, we first provide some reviews and their final attention weights in Figure 8 and then compare the prediction performance of full-text-level attention module in Table 4.   Figure 8 shows examples of the high-weight and low-weight reviews selected by our model, where a ij means the weight of attention. The first reviews for Item 1 and Item 2 had higher weights, while the second reviews for Item 1 and Item 2 were less helpful reviews with lower weights. Generally, the reviews with high attention weight contain more information about the item. For example, the buyers can easily get the feature of each item from Reviews 1a and 2a, which is highly instructive for making purchasing decisions. In contrast, the low attention reviews only contain the authors' general opinions, but give fewer details to help make a decision.
We carried out a prediction and recall test on the Books and Electronics datasets, which contain some reviews that have been rated useful by other users. We assumed that the rated reviews are ground truth to study the performance of full-text-level attention module. We only reserved the items having at least one rated review. We selected three comparison methods: Latest (the rated reviews list are generated by selecting the latest K reviews), Random (the rated reviews list are generated randomly), and Length (the rated reviews list are generated by selecting the longest K reviews). Our MAHR selected K reviews with the highest weights. We calculated the precision and recall for a review list with K reviews according to Precision@K and Recall@K in [21]. Precision@K and Recall@K are as follows: where rel j = 1/0 indicates whether the No.j review in the Top-K list is rated helpful. The Re rated i is the number of rated reviews in item i. To evaluate the effect of length of review list, we set K = 1 and 10.
In Table 4, we can find that the precision and recall of MAHR were impressively better than that of the three other methods. It shows that the weights obtained by full-text-level attention module are consistent with the users' needs and perceptions. By applying the full-text level attention mechanism, the significance of different reviews can be learned effectively.

Conclusions and Future Work
In this paper, We propose a Multi-level Attentional and Hybrid-prediction-based Recommender (MAHR) model that not only leverages the hybrid prediction structures to replace simple inner product of two latent vectors, but also innovatively implement a multi-level attention mechanism to combine word level significance and full-text level usefulness. It selects both useful words and reviews automatically to provide word-level and full-text-level explanations and make a more precise prediction. In addition, it models the non-linear interaction between user and item in a hybrid prediction layer, which couples the factorization machine to a deep neural network. Extensive experiments were made on three real-life datasets from Amazon and Yelp. The visualization and analysis of keyword and useful reviews validated the reasonability of our multi-level attention mechanism. In terms of recommendation performance, the proposed MAHR consistently outperformed the state-of-the-art recommendation models based on matrix factorization and deep learning in rating prediction. We believe this work offers a new approach to capture the context of recommendation systems.
In the future, we plan to combine the transfer learning and the latent factor model to build a more robust prediction layer for recommender system. Moreover, we are interested in the exploration of more advanced neural networks, e.g., Long Short-Term Memory (LSTM) network, which use sequence learning, to handle sequence and sentiment analysis in the review texts.