A Joint Summarization and Pre-Trained Model for Review-Based Recommendation

: Currently, reviews on the Internet contain abundant information about users and products, and this information is of great value to recommendation systems. As a result, review-based recommendations have begun to show their effectiveness and research value. Due to the accumulation of a large number of reviews, it has become very important to extract useful information from reviews. Automatic summarization can capture important information from a set of documents and present it in the form of a brief summary. Therefore, integrating automatic summarization into recommendation systems is a potential approach for solving this problem. Based on this idea, we propose a joint summarization and pre-trained recommendation model for review-based rate prediction. Through automatic summarization and a pre-trained language model, the overall recommendation model learns a ﬁne-grained summary representation of the key content as well as the relationships between words and sentences in each review. The review summary representations of users and items are ﬁnally incorporated into a neural collaborative ﬁltering (CF) framework with interactive attention mechanisms to predict the rating scores. We perform experiments on the Amazon dataset and compare our method with several competitive baselines. Experimental results show that the performance of the proposed model is obviously better than that of the baselines. Relative to the current best results, the average improvements obtained on four sub-datasets randomly selected from the Amazon dataset are approximately 3.29%.


Introduction
With the increasing abundance of products, research on high-quality recommendation systems, especially for the task of rate prediction, has become very important for online e-commerce platforms and users. Most early recommendation systems use collaborative filtering (CF), including user-based collaborative filtering and item-based collaborative filtering. A user-based collaborative filtering method makes recommendations by calculating the similarities between users, while an item-based collaborative filtering method makes recommendations based on the similarities between items. However, CF has its own limitations and drawbacks. First, it has difficulty generating reliable recommendations for users or items with few ratings (the well-known cold-start problem). Another drawback of CF technology is that it does not make full use of the available context information. In other words, context information, such as item attributes [1] or user profiles, is not considered when making recommendations.
Currently, many e-commerce websites not only encourage users to rate products but also encourage users to write product-related reviews. Users can comment on the advantages or disadvantages of the product as well as their experiences with the product in their reviews. User reviews supplement the rating process by providing a wealth of information about the item and the implicit preferences of the users. In addition, these reviews also explain why a user assigned a given rate to the corresponding item [2]. Therefore, to some extent, reviews can help users make purchase decisions, help companies make marketing decisions and provide interpretability for recommendation systems. As a result, user reviews are being gradually introduced into CF methods to alleviate the above problems.
Intuitively, to make full use of users' reviews, we can infer a user's preferences from all reviews made by her; similarly, reviews of an item describe its outstanding attributes. Inspired by the successful use of deep neural networks on natural language processing (NLP) tasks, some recent works have been devoted to modeling user reviews using deep learning approaches. Common approaches usually concatenate all reviews (user reviews and item reviews) first and then employ neural-network-based methods (e.g., convolutional neural networks (CNNs) [3,4], gated recurrent units (GRUs), and long short-term memory (LSTM) [3,[5][6][7][8][9]) to extract a vector representation of the concatenated reviews. However, not all of the reviews are useful for the given recommendation task. To capture the key information in the comments, some models use attention mechanisms to emphasize the key information [10][11][12][13].
Commonly, in the above methods, a review document set is regarded as a set of sentences, and all operations are carried out on the sentence set. However, the lengths of review documents are different, and the relationships between sentences are also different. The above methods lose the semantic and global information inside the review text. To this end, we propose modeling user reviews via a Joint Summarization and Pre-Trained Recommendation model (JSPTRec) for the task of rate prediction. This model applies automatic text summarization and compresses the review of a user or an item into a brief summary. In this way, not only the key information but also the relationships between words and sentences in the review are preserved. Then, we use a pre-trained model called "bidirectional encoder representations from transformers" (BERT) [14] to learn the deep semantic representations of the summaries. To capture more fine-grained user preferences or item properties, we use an attention mechanism to distinguish between different review summaries by interacting with user and item vectors. Finally, we try to incorporate the review information of users and items into a neural CF framework [15] to predict the final rating score. To the best of our knowledge, this is the first work to combine automatic summarization and a pre-trained model into a neural recommendation framework used for rate prediction. We compare our method with several competitive baselines on the Amazon dataset, and the experimental results demonstrate that our method is obviously better than other methods. The average improvement is approximately 3.29% over the current best results on the four utilized datasets. We also carry out an ablation study to verify the effectiveness of each part of the JSPTRec model.

Methods
In this section, we introduce our recommendation method, which models user reviews via a joint summarization and pre-trained model for rate prediction. The overall model is shown in Figure 1, and it consists of four parts, namely a review summarization layer, a BERT representation layer, an interactive attention layer and a rate prediction layer. Table 1 shows the notations use with the model. A basis of all four parts is that we assume that there exists a K-dimensional latent factor space. Each user or each item is represented as a feature vector in this K-dimensional space, and a user's rating of an item can be calculated by the corresponding feature vectors. We use v user u to denote the vector for user u and v item i to denote the vector for item i. For user u, rw user u,j is the j-th review in u's review set C user u = rw user u,1 , rw user u,2 , . . ., rw user u,n . e item u,j is the corresponding item ID embedding of review rw user u,j . Similarly, for an item i, rw item i,j is the j-th review in i's review set C item . e user i,j is the corresponding user ID embedding of review rw item i,j . For a pair containing user u and item i, we define an affinity score rate u,i that models user u 's preference for item i.

Review Summarization Layer
To remove redundant information and retain useful information in the reviews, we used the unsupervised algorithm TextRank [16] to extract a summary of each review. TextRank is a graph-based ranking algorithm that models a review as a graph G = (V, E). The node set V represents the sentences in the review. E is the set of edges, the weights of which represent the similarities between sentences. The similarity between sentence V i and sentence V j can be calculated with the following formula: where w ij is the weight of the edge between node V i and node V j , word k is a word shared by the sentences, {word k | word k ∈ V i &word k ∈ V j } is the number of words common to sentence V i and sentence V j , |V i | is the length of sentence V i , and V j is the length of sentence V j . We give each node an initial value that indicates the importance of each sentence and then iteratively update the values of the nodes with the following formula: where WS(V i ) is the value of node V i , In(V i ) is the set of nodes pointing to V i , Out(V i ) is the set of nodes to which V i points, w ij is the weight between node V i and node V j , and d is a damping factor between 0 and 1. Through multiple iterations, the values of the nodes tend to converge, and the values reflect the importance of the nodes. Because these nodes correspond to the sentences in the user review, these values also represent the importance of the sentences in the reviews. Then, we rank the important scores of the sentences and choose the top-n sentences as the summary of the review.
For each review rw user u,j , we calculate the importance of each sentence in the rw user u,j through TextRank method and then select the most important K sentences as summary s item i,j . K can be obtained by µ × |rw user u,j |, where µ is the proportion of the review summary and |rw user u,j | is the number of sentences in the review. Similarly, for each item review rw item i,j , the summary is calculated in the same way.
Therefore, user u's original review set C user u = rw user u,1 , rw user u,2 , . . ., rw user u,n and item i's can be replaced by the corresponding summary sets S user u = s user u,1 , s user u,2 , . . ., s user u,n , respectively.

BERT Representation Layer
After obtaining the summaries of the user reviews and item reviews, we use the BERT model [14] to further learn the text representations of the summaries. BERT is an effective pre-trained model that builds upon the transformer architecture. First, it randomly masks 10% to 15% of the words and tries to predict those masked words. Second, BERT takes an input sentence and a candidate sentence and then predicts whether the candidate sentence follows the input sentence. We choose BERT BASE with 12 layers, 768 hidden dimensions, 12 heads, and 110 M parameters as our initial embedding model. The BERT parameters are fine-tuned during the training process of our model. Each summary s is represented as a matrix R W×E s , where W is the length of the summary and E is the embedding dimensionality of the words. Then, we perform an average pooling operation over the user's summary and item's summary separately. Hence, each summary is represented as an E-dimensional vector.
Finally, the user's summary set S user u = s user u,1 , s user u,2 , . . ., s user u,n and the item's summary can be represented as comprehensive summary vectors S user u ∈ R n×E and S item i ∈ R m×E , respectively.

Interactive Attention Layer
To focus on the review summaries that are important for predicting user preferences, we use an attention mechanism to capture the interactions between the user and the corresponding item. Given a set of summary representations S user u = s user u,1 , s user u,2 , . . ., s user u,n for user u, we can calculate the attention score α user u,j of s user u,j with the following formulas: where W 1 , W 2 ∈ R E×E , W 3 ∈ R E×1 , and b 1 ∈ R E , b 2 ∈ R are all trainable parameters; e item u,j is the corresponding item id embedding of user u's review rw user u,j ; n is the number of reviews provided by a user; and ReLU [17] is a nonlinear activation function: For j = 1, 2, 3, . . . , n, we can obtain attention scores α user u = α user u,1 , α user u,2 , . . ., α user u,n for the user's summary S user u = s user u,1 , s user u,2 , . . ., s user u,n . Similarly, we can calculate the attention scores α item with the following formulas: where α item i,j is the attention score of item i's j-th summary s item i,j , e user i,j is the corresponding user ID embedding of item i's review rw item i,j , and m is the number of reviews for an item. For j = 1, 2, 3, . . . , m, we can obtain attention scores α item . Subsequently, the final review representation vector of user reviews and item reviews can be obtained by a weighted summation of the summary representations of all reviews with the following formulas:

Rate Prediction Layer
To focus on the effective information in the reviews used for recommendation, we incorporate the summaries of user u and item i to obtain review representation vectors A user u and A item i , respectively. We then obtain the feature vectors of the users and items through their review representation vectors. F u , F i ∈ R D are the final feature vectors of user u and item i with the following formulas: where W u , W i ∈ R E×D are trainable weighted parameters, D is the dimensionality of the feature vector, and b u ∈ R D and b i ∈ R D are the bias vectors of the user and item that can record long-term information.
Finally, for a pair containing user u and item i, the affinity score rate u,i can be viewed as user u's preference for item i with the following formula: where W ∈ R D×1 and b ∈ R are trainable parameters and b is the bias of the rating score. We use the mean squared error (MSE) as the loss function to train our model, and we optimize the model by minimizing the MSE between the output score from our model rate u,i and the real score rate u,i with the following formula:

Results and Discussion
In this section, we empirically evaluate the various components of our proposed JSPTRec model for rate prediction. We conduct experiments to answer the following research questions: (i) How much can user reviews help rate prediction compared with CF baselines? (ii) Does our method perform better than other baseline methods that also use a hybrid CF and review-based recommendation approach? (iii) Can a summarization and pre-trained model help accomplish recommendation tasks, and if so, which part is most useful?

Dataset and Evaluation Metric
We conducted experiments on four different datasets from Amazon Review Data (https://nijianmo.github.io/amazon/index.html) (accessed on 2 April 2019). Table 2 shows the numbers of users, items, and reviews in each dataset. The Amazon Review dataset includes reviews (ratings, text, helpfulness votes), product metadata (descriptions, category information, prices, brands, and image features), and links (also viewed/also bought graphs). For each dataset, we randomly select 80% of the user-item pairs as the training set, 10% as the validation set, and 10% as the testing set. We use only the reviews in the training set to learn representations for the users and items and do not use the reviews in the validation and testing sets. In our experiments, we adopt the widely used MSE to evaluate the performances of the compared algorithms.

Compared Methods
We compare our model with several competitive baselines, including CF-based methods and deep learning-based methods, by using reviews. To perform the experiments, we use an open source code on github (https://github.com/JieniChen/Recommender-System) (accessed on 21 April 2020) for PMF and NMF. For other baselines, we use the codes provided by the authors, respectively.

•
Probabilistic matrix factorization (PMF (https://github.com/JieniChen/Recommender-System) (accessed on 21 April 2020)) [18]: PMF is a widely used rating-based CF method. PMF assumes that the elements in the scoring matrix are determined by the inner product of the user's potential preference vector and the item's potential attribute vector. • Nonnegative matrix factorization (NMF (https://github.com/JieniChen/Recommender-System) (accessed on 21 April 2020)) [19]: NMF is also a rating-based CF method. It assumes that the decomposed matrix should satisfy nonnegativity constraints. NMF can decompose a nonnegative matrix into two nonnegative matrices. • Hidden factors and hidden topics (HFT (http://cseweb.ucsd.edu/jmcauley/code/ code_RecSys13.tar.gz) (accessed on 9 February 2021)) [20]: HFT models the given ratings using a matrix factorization model with an exponential transformation function to link the stochastic topic distribution obtained from modeling the review text and the latent vector obtained from modeling the ratings. It assumes that the topic distribution of each review is produced on either user factors or item factors. In this way, HFT can provide an interpretation of each latent factor because factors and topics are located in the same space.  [10]: The MPCN is based on the idea that a few reviews are important and that the importance depends dynamically on the current target. To extract important reviews, the MPCN contains a review-by-review pointer-based learning scheme that matches reviews in a word-by-word fashion. The pointer mechanism used in the MPCN is essentially coattentive and can learn the dependencies between users and items. proposes an attention mechanism to explore the usefulness of different reviews. The weights of reviews are learned by an attention mechanism in a distant supervised manner. Moreover, the NARRE learns the latent features of users and items using two parallel neural networks.

Experimental Settings
For our JSPTRec model, the dimensionality of the user and item vectors and their ID embedding vectors is set to 32. The learning rate is set to 0.001. The proportion of the summary extracted from the review is set to 0.6. For all the baselines, we use the same settings as those in their original papers. For PMF and NMF, the number of factors is set to 100. For HFT, the latent dimensions and number of topics are both determined by the parameter K. We set K = 10, which is the same as that in the original paper. For DeepCoNN, the number of convolutional kernels is set to 100, and the window size is 3. D-Attn uses 200 filters and a window size of 5 for local attention, and it uses 100 filters and window sizes of [2,3,4] for global attention. The MPCN uses three pointers and 300 hidden dimensions to infer the affinity matrix.

Experimental Results
In Table 3, we compare the results of our method and the baseline methods on four different datasets. From the experimental results in Table 3, we can draw the following conclusions: (i) HFT, D-Attn, DeepCoNN, the MPCN, and the NARRE generally perform better than PMF and NMF because the review-based methods benefit from the introduction of user reviews. This indicates that user reviews are helpful for completing recommendation tasks.
(ii) D-Attn, DeepCoNN, the MPCN, and the NARRE outperform HFT, indicating that the deep-learning-based methods are more effective than CF-based methods in terms of modeling user reviews and understanding the semantic information in text.
(iii) By selecting or weighting user reviews, D-Attn, the MPCN, and the NARRE outperform DeepCoNN, which suggests that different reviews exhibit different importance levels for modeling users and items in rate prediction tasks.
(iv) The proposed JSPTRec model outperforms all the baseline methods. This shows that the recommendation method based on summarization and a pre-trained model is effective, as this approach can retain important review information and obtain the best recommendation results.

Parameter Sensitivity Analysis
We would like to analyze how sensitive the performance of our model is with regard to the parameters on the Musical Instruments dataset. First, we varied the dimensionality of the user and item vectors while fixing the other parameters. Figure 2 shows the performances of PMF, NMF, DeepCoNN, the NARRE, and our model. PMF is greatly influenced by dimensionality, and the accuracy of its predictions increases significantly with increasing dimensionality. NMF, DeepCoNN, and the NARRE all have stable performances with different numbers of dimensions, and our model achieves the best performance with all dimensionality settings. Then, we tested our model with different proportions of the summary extracted from the review. We set the proportion µ to 0.2, 0.4, 0.6, 0.8, and 1. In Figure 3, with the increase in µ, more information in the review is retained. When µ = 0.2 or 0.4, the recommendation effect is poor because too little review text is extracted, resulting in the loss of some of the valid information. We find that when µ = 0.6, the best results can be achieved. When µ = 0.8 or 1, the recommendation effect is slightly worse than that when µ = 0.6, which means that we can obtain almost all the useful semantic information from reviews by using only 60% of the review text, and too much text introduces noise.

Ablation Study
To test the effectiveness of each part of our model, we also conducted ablation experiments. JSPTRec-BERT, JSPTRec-TR, and JSPTRec-ATT are three weaker variations of our complete model. JSPTRec-BERT represents our model without the pre-trained model (BERT), in which the word embeddings are initialized by Glove. JSPTRec-TR represents our model without the summarization model (TextRank). JSPTRec-ATT represents our model without an interactive attention layer. In JSPTRec-TR, all the user reviews from a given user or item are concatenated into a long document as the input. From Table 4, we can find that JSPTRec outperforms JSPTRec-BERT, which demonstrates that the pre-trained model is effective in learning deep user preferences and item properties from user review texts. Furthermore, our model is even better than the model without summarization, JSPTRec-TR (using all review texts), indicating that the summarization layer can retain the "key" information from the review text and reduce the calculations required of the model. Finally, JSPTRec obtains superior results to those of JSPTRec-ATT, which shows the effectiveness of the interactive attention mechanism.

Related Work
With the increasing amount of network information, recommendation systems have become widely used [22]. In this section, we present three lines of work that are related to our task, namely, CF-based recommendation, review summarization, and deep-learningbased review modeling.

Collaborative Filtering Based Recommendation
CF [23] uses the aggregated behaviors/tastes of a large number of users to suggest relevant items to specific users. Recommendations generated by CF are based solely on user-user and/or item-item similarities, which are popular and widely deployed by Internet companies such as Amazon [24], Google News [25], and others. In addition to CF based on users and items, another kind of method exists: model-based CF. The main idea of matrix factorization is to construct an implicit semantic model; that is, by decomposing the sorted and extracted "user item" scoring matrix, a user latent vector matrix and an item latent vector matrix can be obtained. There are many matrix factorization models, such as the latent factor model (LMF), singular value decomposition (SVD), and PMF. PMF [18] is widely used because it scales linearly with the number of observations and performs well on very sparse and imbalanced datasets. To improve the interpretability of the model, NMF [19] imposes a nonnegativity constraint on the two decomposed small matrices on the basis of SVD.
Cold start is a common problem in CF-based recommendation systems. Many scholars have tried to alleviate this problem by introducing various external information. Dhelim et al. [26] proposed a product recommendation system based on user interest mining and metapath discovery to alleviate the cold start problem. In addition, users' social relations also contain rich user characteristics. Khelloufi et al. [27] took advantage of the social relations between devices to select a suitable service that fits the requirements of the applications and devices, based on the observation that having a given personality type does not necessarily mean that you are compatible with people sharing the same personality type. Ning et al. [28] designed a friend recommendation system based on the big-five personality traits model and hybrid filtering, in which the friend recommended process is based on personality traits and users' harmony rating.
Using user reviews to alleviate the cold-start problem in CF has attracted extensive attention in recent years. Wang and Blei [29] first combined the merits of traditional CF and probabilistic topic modeling. The clickthrough rate (CTR) model integrates PMF and latent Dirichlet allocation (LDA) into the same probability framework in a tightly coupled way. HFT [20] models user reviews with matrix factorization and assumes that the topic distribution of each review is produced by the latent factors of the corresponding item. King [30] proposed a unified model called "ratings meet reviews" (RMR) that combines content-based filtering with CF, harnessing the information of both ratings and reviews. RMR applies topic modeling techniques to the review text and aligns the topics with rating dimensions to improve prediction accuracy.
In recent years, CF has been combined with deep learning models. Most matrix factorization methods apply an inner product to the latent features of users and items. Salakhutdinov et al. [31] demonstrated that restricted Boltzmann machines (RBMs) can be applied to rate prediction tasks and slightly outperform carefully tuned SVD models. By replacing the inner product with a neural architecture that can learn an arbitrary function from data, He et al. [15] developed a general framework called neural collaborative filtering (NCF) and proposed leveraging a multilayer perceptron to learn nonlinear user-item interaction functions.

Review Summarization
With the deepening and increasing number of product reviews, it is a growing challenge for customers and product manufacturers to gain a comprehensive understanding of their contents. Automatic summarization of reviews aims to mine and summarize all the customer reviews of a product, which can capture important information. It is a key step for review document understanding and sentiment analysis [32,33]. Shimada et al. [34] proposed a method for generating a summary that contains sentiment information and objective information of a product. The authors use three features: ratings of aspects, the value, and the number of mentions with a similar topic to generate a more appropriate summary. Due to the importance of the product feature and opinion extraction to review summarization, Nyaung and Thein [35] refer the task of review summarization to relating the opinion words with respect to a certain feature. Mabrouk et al. [36] proposed a methodology to summarize aspects and spot opinions regarding them using a combination of template information with customer reviews in two main phases. Recently, some deep neural models have been used in review summarization. In order to achieve generative review summarization, a neural attention network model with sequence-to-sequence learning was conducted [37]. By focusing on the feature of review summarization samples, the local attention mechanism is improved that has more attention weights on the start of the source sentence. Then, each word of the summary is generated through the end-to-end model. Xu et al. [38] proposed a neural review-level attention model to effectively learn user preference embedding and product characteristic embedding from their history reviews. Then, they designed a personalized decoder to generate the personalized summary, which utilizes the representations of the user and the product to calculate saliency scores for words in the input review to guide the summary-generation process. Finally, a multi-task framework was used to joint optimize the summary generation and rating prediction.

Deep Learning Based Review Modeling
Users often post reviews on the Internet, and these reviews contain rich semantic information about users and items. Recently, some works have employed deep learning algorithms to model auxiliary review information, such as the textual descriptions of items and preferences of users.
To capture multiangle features or multilevel features, some methods apply CNNs [4] to model user reviews. Seo et al. [2] proposed a CNN-based recommendation model with local and global attention. Kim et al. [39] combined CNNs with PMF to better capture contextual information. However, important semantic features may be contained in text segments of different granularities. Wang et al. [3] designed a hierarchical and fine-grained CNN-based recommendation model that can obtain multilevel user/item representations and match them separately. In addition to CNN models, the recurrent neural network (RNN) model and its variants (GRUs and LSTM) have also been adopted to extract much semantic information from user reviews [40][41][42][43][44][45]. Li et al. [45] used gated RNNs to learn user and item latent representations from reviews. They designed a sequence decoding model based on a gated RNN called a GRU. This model not only predicts ratings but also generates abstractive tips based only on user latent factors and item latent factors [45].
To further select important information from review texts, attention mechanisms have been used for user review modeling [10,[46][47][48][49][50]. Attention mechanisms can focus on important information or capture the correlations between users and items. Chin et al. [46] merged all reviews provided by a given user into a long document to extract aspect-level representations of users and items. Then, they used a coattention mechanism to build the correlation matrix between the users and items. To capture the relationship between reviews and a target item, Tay et al. [10] applied an attention mechanism at both the review level and word level to dynamically select important reviews for the target item. Similarly to Tay et al. [10], Liu et al. [47] also used multilayer attention. They utilized local and mutual attention on top of CNNs to jointly learn the features of reviews. Zhao et al. [48] used explicit behavior factors, such as retweeting and mentioning, to understand users and utilized an attention mechanism that could automatically learn the weights of different factors. However, existing techniques mainly extract the latent representations of users and items in an independent and static manner. Wu et al. [51] proposed a novel context-aware user and item representation learning model that uses two separate learning components to exploit review data and interaction data: review-based feature learning and interactionbased feature learning, respectively.
Some models also use additional information to supplement review-based recommendations. For example, Ye et al. [52] used not only reviews but also product images. They presented a novel collaborative neural model for rating prediction by jointly utilizing user reviews and product images. They coupled the processes of rating prediction and review generation via a deep neural network and generated review content using an LSTM-based model. Probability-based methods are also used in rating prediction. Lei et al. [53] used LDA, which is a Bayesian model, to model the relationships between reviews, topics, and words.
In contrast, pre-trained language models such as Elmo [54], the generative pre-trained transformer (GPT) [55,56], and BERT [14] have shown good performance on many NLP tasks. The existing pre-trained language models are mainly based on RNNs [54,57,58] and transformers [14,55,56]. Among these, BERT (Bidirectional Encoder Representations from Transformers [14]) is a very effective pre-trained model that can obtain bidirectional representations of context. Motivated by the above successes, we propose modeling user reviews via a joint summarization and pre-trained model for the task of rate prediction. We perform au-tomatic text summarization to compress all reviews into a brief summary that not only extracts the key information but also preserves the relationships between the words and sentences in the review. Then, a pre-trained model (BERT [14]) is used to learn the deep semantic representations of the summaries, and interactive attention is used to focus on important information and produce high-quality summary representations. Finally, we try to incorporate the review summary representations of users and items into a neural CF framework [15] to predict the rating score. To the best of our knowledge, this is the first work that combines automatic summarization and a pre-trained model into a neural recommendation framework for the task of rate prediction.

Conclusions
In this paper, we proposed a joint summarization and pre-trained recommendation model called JSPTRec for review-based recommendation. The model benefits from automatic summary extraction, a pre-trained model, and interactive attention mechanisms. We designed experiments to evaluate our model against several state-of-the-art models. Via a comparison with CF-based methods, we found that user reviews were significantly helpful, indicating that it is important to introduce user review texts for rate prediction. Second, we found it beneficial to perform summarization to capture the important information from a large number of reviews. Third, we found that compared with other deep-learning-based methods, the pre-trained model BERT can learn better semantic representations of reviews for users and items. Finally, by using interactive user and item attention mechanisms, the recommendation performance of our model can be further improved.