Exploring Clustering-Based Reinforcement Learning for Personalized Book Recommendation in Digital Library

: Digital library as one of the most important ways in helping students acquire professional knowledge and improve their professional level has gained great attention in recent years. However, its large collection (especially the book resources) hinders students from ﬁnding the resources that they are interested in. To overcome this challenge, many researchers have already turned to recommendation algorithms. Compared with traditional recommendation tasks, in the digital library, there are two challenges in book recommendation problems. The ﬁrst is that users may borrow books that they are not interested in (i.e., noisy borrowing behaviours), such as borrowing books for classmates. The second is that the number of books in a digital library is usually very large, which means one student can only borrow a small set of books in history (i.e., data sparsity issue). As the noisy interactions in students’ borrowing sequences may harm the recommendation performance of a book recommender, we focus on reﬁning recommendations via ﬁltering out data noises. Moreover, due to the the lack of direct supervision information, we treat noise ﬁltering in sequences as a decision-making process and innovatively introduce a reinforcement learning method as our recommendation framework. Furthermore, to overcome the sparsity issue of students’ borrowing behaviours, a clustering-based reinforcement learning algorithm is further developed. Experimental results on two real-world datasets demonstrate the superiority of our proposed method compared with several state-of-the-art recommendation methods.


Introduction
Digital library as one of the most important ways in helping students acquire professional knowledge and improve their professional level has gained great attention recently. Many universities have established their digital libraries with digital resource ranges from tens of thousands to millions. On the one hand, the digital library is more convenient in use and management than traditional libraries. On the other hand, it is a huge challenge for students to find the required resources (such as books, reports, and periodicals). To overcome this challenge, we resort to recommender systems [1][2][3][4], which can leverage users' historical records to help them efficiently discover interesting and high-quality information. The book recommendation task in a digital library is to recommend books at time t + 1 to a set of users given the historical book borrowing records before time t.
Typical works of previous research on this task are focused on developing recommendation algorithms that can recommend books in a personalized way. For example, Yang et al. [5] introduced a book recommender system for book inquiry history analysis and proposed a model for book inquiry history analysis and book-acquisition recommendation. Sohail et al. [6] proposed an OWA-Based ranking approach for the book recommendation. They used an ordered weighted aggregation method to aggregate several rankings of top universities, which reduces the complexity of providing personalized recommendations to a large number of users. Priyanka et al. [7] developed a personalized book recommender system based on opinion mining technique and presented an online book recommender system for users who purchase books by considering the specific functions of books.
However, all the above existing methods ignored a widely reached consensus that the noisy data can mislead the recommendation algorithms. For example, a student studying computer science and technology may have borrowed few psychological books from the library for public elective courses or other students. When recommending books to him/her, we should mainly focus on recommending books related to computer science, and ignore the influence of these psychological books. We define the items in the user's borrowing record that the user is not interested in as noise in the user's borrowing record. Moreover, all the above methods ignored another fact that the users' borrowing records are very sparse, that is, a user may only borrow a few books in his/her college years, which results in that the learned user interest model is unreliable. We define the small amount of user borrowing and the small amount of each book borrowed as the data sparseness problem in the book recommendation task. Noise problem and user sparsity problem are two problems that need to be solved in the book recommendation.
On the other hand, reinforcement learning-based recommender system has achieved great success in many other related areas, such as course recommendation in MOOC platforms and TV program recommendations. For example, Zhang et al. [8] developed a Hierarchical Reinforcement Learning(HRL) framework to alleviate the impact of noisy user behaviors in course recommendation task(HRL-NAIS), where a hierarchical agent is developed to filter the noisy actions that might mislead the recommendation algorithm. However, their work is developed for the course recommendation in MOOC platforms rather than book recommendation in digital libraries. Compared with course recommendation, book recommendation is significantly different from it. First, there are a huge number of books in the library. The types of books in the book recommendation are much more than the types of courses in the course recommendation. Second, the average amount of books borrowed by users in the digital library is small. Compared with the average subscription amount in the course recommendation, the average borrowing amount of the users in the digital library is only half of the course recommendation, which means the data in book recommendation is sparser, and each book will only be borrowed a few times on average, while in course recommendation each course may be subscribed hundreds of times on average.
To overcome the above challenges, we introduce HRL into book recommendation task, and further propose a clustering-based HRL in dealing with the sparsity issue of users' interactions. More specifically, to filter out the noisy data that existed in users' borrowing sequence, we train a hierarchical reinforcement network to select the books that can help the recommender most. To alleviate the sparsity issue of users' interactions, we cluster the feature of the books before feeding the book data into the hierarchical reinforcement network, so that the hierarchical reinforcement network can better.
The main contributions of this work are listed as follows: 1.
We introduce HRL into the book recommendation task in the digital library, where a basic book recommender is first pre-trained, and then a hierarchical agent is devised to filter out the interactions that might miss leading this recommender.

2.
To reduce the impact of the sparsity issue, we further enhanced the HRL by a clustering-based strategy, where a clustering strategy between the pre-trained network and the hierarchical reinforcement network is incorporated to reduce the data sparsity issue of the book borrowing data.

3.
We conduct extensive experiments on two real-world datasets, and the experimental results demonstrate the superiority of our solution compared with several state-ofthe-art recommendation methods.

Related Works
This section reviews the related works from the following aspects, that is, book recommendation methods and reinforcement learning-based recommendation methods.

Book Recommendation Methods
Book recommendation as an important branch of recommender system has been widely studied in recent years. Its target is to recommend suitable books to users that have interests in them. Existing recommendation methods can be classified into collaborative filtering-based methods and general machine learning-based methods. For example, Ansari et al. [9] studied the advantages of content-based and collaborative filtering methods, proposed a preference model used in marketing to provide a good choice, and described a Bayesian preference model. Ziegler et al. [10] proposed diversification of topics, which is a new method that aims to balance and diversify personalized recommendation lists to reflect the complete range of interests of users. Konstas et al. [11] created a collaborative recommendation system that can effectively adapt to the personal information needs of each user. They adopted a general framework of random walks with restarts to provide a more natural and effective way to represent social networks. Robillard et al. [12] proposed a software engineering recommendation system that can help developers in various activities, from reusing code to writing effective error reports. Smyth et al. [13] proposed a case-based recommendation system based on a well-defined feature set (such as price, color, brand, etc.). These representations enable case-based recommenders to make judgments on the similarity of products to improve the quality of their recommendations. Fu et al. [14] captures the user's navigation history and applies data mining techniques to discover the hidden knowledge contained in the history. This knowledge will then be used to suggest potentially interesting web pages to the user. Drineas et al. [15] put forward the concept of a competitive recommendation system. They simplified the problem of obtaining competitiveness into a problem in matrix reconstruction. Then a competitive matrix reconstruction scheme is proposed. The above methods solve the recommendation problem based on collaborative filtering. However, methods based on collaborative filtering have certain limitations: They tend to recommend popular items in collaborative filtering algorithms, so the algorithm is less exploratory. In addition, the algorithm cannot perform well when the data is sparse. So with the popularity of machine learning, more and more machine learning methods are used in recommendation systems.
Besides traditional collaborative filtering methods, many researches also focused on generalized machine learning-based methods. For example, Sabitha et al. [16] proposed a book recommendation method based on user NN. In order to optimize collaborative filtering, Goel et al. [17] proposed a bee algorithm that uses natural inspiration. Natural language processing in machine learning is applied in the algorithm. Mikawa et al. [18] used the Support Vector Machine(SVM) algorithm in the book recommendation. SVM is a kernel-based technique that is used for the categorization of binary data. The model in the article recognizes the gender of the mobile person and recommends magazines to the person based on gender. The model shows a high recommended speed. The neural network model is inspired by the human brain, constituted by the connection between the node and another neuron. This method is used in the model built by liu et al. [19], which proves that the blending neural network outperforms the blending with Linear Regression. Maneewongvatana et al. [20] used k-means clustering to recommend books in university libraries. They browsed the library's borrowing history, and after cleaning the data, they assigned them to different clusters based on the similarity of the topics. Then implemented association rule mining to recommend books. Yang et al. [21] used text mining technology to establish a recommendation model for book purchases. They use three modules: keyword density vocabulary, keyword sequence vocabulary, and keyword book mapping, and establishes extraction of keywords, matching with keywords in the book library to obtain recommendation books, Recommendation system that generates recommendation lists. Tewari et al. [22] proposed a method of recommending books to readers. The system combines the functions of content filtering, collaborative filtering, and association rule mining to generate effective suggestions. Sohail et al. [23] proposed a recommendation technique based on opinion mining. According to customer needs and comments collected from customers, the article classifies the functions of the books. The article analyzes the functions based on several features that have been classified and commented on by users. According to the importance and purpose of the weight, the weight is assigned to the classified features, and the grade is given accordingly. Kanetkar et al. [24] proposed a model of a Web-based personalized hybrid book recommender system, which uses various aspects other than conventional collaboration and content-based filtering methods to provide recommendations. Vaz et al. [25] proposed a hybrid recommendation task that combines two item-based collaborative filtering algorithms to predict users' favorite books and authors. The author's forecast was expanded into a bibliographic list, which was then aggregated with previous book forecasts. Finally, the generated book list will be used to generate the top n book recommendations. The use of machine learning methods can use the characteristics of books to classify books, which can solve the problem of too many popular items recommended in the collaborative filtering method. The clustering method in deep learning can also solve the problem of data sparsity in the recommendation system. However, the above methods did not consider the noisy data that existed in users' book borrowing records, which may mislead the recommendation methods. We need a new method that can remove the noise in the sequence.

Deep Reinforcement Learning
Deep reinforcement learning is a machine learning method that combines reinforcement learning and deep learning. The characteristic of reinforcement learning is to map the current state to certain actions, and then select the best action based on the return value of these actions. The characteristic of deep learning is to obtain observation information from the environment. Combining the characteristics of reinforcement learning and deep learning can play the decision-making ability of reinforcement learning and the perceived ability of deep learning, and provide ideas for complex perception decision-making problems.
Mnih et al. [26] first proposed the DQN algorithm, for the first time combining deep learning with reinforcement learning. The DQN algorithm puts the calculation of the Q value in reinforcement learning into the deep learning model for calculation. In the following years, Hausknecht and Stone [27] proposed the Deep Recurrent Q-Learning algorithm, Wang et al. [28] proposed the Dueling DQN algorithm, Van et al. [29] proposed the Double DQN algorithm, Schaul et al. [30] proposed the Prioritized Experience Replay algorithm, Hessel et al. [31] proposed Rainbow DQN algorithm, these algorithms are based on the improvements made by DQN algorithm.
In the reinforcement learning algorithm based on DQN, the algorithm is based on value to learn. Among the deep reinforcement learning algorithms, there is another type of algorithm that is based on policy to learn. The A3C algorithm proposed by Mnih et al. [32] is based on the policy gradient algorithm. The A3C contains an actor-network and an evaluator network and selects the optimal action for training through the two networks.
The hierarchical reinforcement learning model is a reinforcement learning algorithm that can handle complex environments. The hierarchical reinforcement learning model can divide the entire task into high-level tasks and low-level tasks. The high-level tasks plan long-term benefits, and the low-level tasks directly interact with the environment. There are several articles on the research of the hierarchical reinforcement learning model: Vezhnevets et al. [33] proposed the STRAW algorithm, and then proposed the Feudal Networks algorithm [34], Nachum et al. [35] proposed the HIRO algorithm. The hierarchical reinforcement learning model can effectively deal with sparse rewards, so it can be effectively used in the recommendation system.

Reinforcement Learning-Based Recommendation Methods
The traditional recommendation task analyzes the user's actions and recommends similar content to the user. This recommendation has the following problems: It is unable to capture the dynamic changes of user interest; it can only calculate the user's current revenue; it is difficult to accumulate long-term revenue. One way to solve those problems is to introduce reinforcement learning into the recommendation task.
Reinforcement learning is the mapping relationship between learning state and behavior to maximize the numerical return. In other words, without knowing what behavior to take, the learner must keep trying to find out which behavior can produce the greatest reward. Theocharous et al. [36] proposed the use of reinforcement learning in advertising recommendations, pointed out the benefits of reinforcement learning in recommendation systems: Considering the long-term effects of actions. The paper also illustrates the challenges of reinforcement learning in recommendation systems: how to calculate good strategies and how to use historical data to evaluate solutions. Wang et al. [37] proposed to apply reinforcement learning to music recommendation. This paper balances the needs of exploring user preferences and of using this information to make recommendations. To learn user preferences, this paper uses a Bayesian model, which considers the novelty of audio content and recommendations, and regards music recommendations and playlist generation as a unified Model. Zheng et al. [38] proposed the application of reinforcement learning to news recommendation, and proposed a recommendation framework based on deep Q-learning, which solves three problems of news recommendation: It only attempts to simulate the current rewards; it rarely studies except click and no click Operation; it tends to recommend similar news. Wang et al. [39] proposed a dynamic therapy recommendation task based on large-scale electronic health records. The paper combines the benefits of supervised learning and reinforcement learning and proposes supervised reinforcement learning with regression neural networks. The paper uses an actor-critic framework to deal with the complex relationship between multiple drugs, diseases, and individual characteristics, which can provide effective advice to doctors. Zhao et al. [40] focused on dealing with the negative feedback of reinforcement learning in the recommendation task, developed a deep recommendation task framework, and combined negative feedback with positive feedback. Zhao et al. [41] combined reinforcement learning, recurrent neural network, and convolutional neural network to design the model, and used the model in e-commerce recommendation to improve the accuracy of e-commerce recommendation. Chen et al. [42] proposed two techniques to alleviate the problem of unstable reward estimation in dynamic environments. The article addresses the problem of high variance and deviation estimation of rewards in terms of samples and rewards respectively and integrates these two techniques with deep reinforcement learning. Rohde et al. [43] introduced RecoGym, a reinforcement learning environment defined by the user traffic model in e-commerce and the user's recommendation on the website. It can be seen from the above research that the application of reinforcement learning in the recommendation task can consider the long-term impact of each action of the user, and it can also solve the problem of single recommendation content, and solves the problem that the traditional recommendation task cannot model the dynamic changes of user interest.
The recommendation system model based on reinforcement learning can better capture the dynamic changes of user interests, and the noise in the sequence can be processed by introducing reinforcement learning. However, because book data is relatively sparse, and these works did not consider the problem of data sparseness, we need to establish a recommendation system model based on reinforcement learning suitable for book data.

Our Recommendation Framework
The section first defines the book recommendation task and then introduces the HRL framework to filter the noisy data exited in users' historical interactions. Finally, our clustering-based reinforcement learning method is introduced in detail.

Task Definition
The task of book recommendation for a user can be simply formalized as next item prediction based on his/her historical borrowing records, where the borrowing sequence before time t are given, and we aim at recommending the most relevant books that will be enrolled by the user at time t + 1.
Let U = {u 1 , . . . , u m } be the set of users and B = {b 1 , . . . , b n } be the set of books, where m is the number of users, n is the number of books. For each user u ∈ U , his/her borrowing sequence E u := (b u 1 , . . . , b u j , . . . , b u t ) from a digital library is given, where t denotes the time stamp of the corresponding record. Then, the book recommendation task can be formulated as predicting the next book that the user is interested.
Input: The set of users U = {u 1 , . . . , u m }, the set of books B = {b 1 , . . . , b n }, and each user's borrowing sequence E u : . Output: The next book b u t+1 that the user u is most interested in.

Overview of Our CHRL Method
To solve the data noisy and data sparseness challenges in book recommendation, we propose a Clustering-based Hierarchical Reinforcement Learning Network (CHRL) as our solution, whose main idea is to leverage the power of the clustering-based reinforcement learning technique to filter out noisy interactions that may mislead the recommendation algorithms. Figure 1 shows the architecture of our model, which consists of three components: Basic recommender, profile reviser, and clustering component. More specifically, the basic recommender aims to provide the foundation model of book recommendation, which models user's and item's preferences by an attention-based neural network. Same as HRL [8], we exploit NAIS [44] as our basic model. Note that, other sequence modeling techniques can also be utilized. The profile reviser aims to further filter out noisy interactions that may mislead the basic recommender. A clustering-based hierarchical reinforcement learning network is developed. Different the work in HRL [8], we develop a clustering component to alleviate the data sparsity issue via clustering the embedding of all the books to categorize the books. The workflow of our CHRL model is shown in Figure 2. First, the pre-training process of CHRL trains the borrowing sequence of all users. After pre-training, the clustering component clusters the embedding of all the learned books that will be utilized in the sequence modification component. Next, the sequence modification component will determine whether there are noises in the borrowing sequence of each user and then filter out these noisy interactions. Finally, the selected sequences will be re-sent to the pre-trained model in the first stage. After that, the pre-trained basic recommender and HRL will be jointly trained to obtain the optimal recommendation results. Our CHRL uses HRL to filter out the noisy data in the sequence, and resort to the clustering mechanism to alleviate the data sparsity issue of book recommendation.

Basic Recommender
To model user's borrowing sequences, we follow the work in HRL [8] and exploit NAIS [44] as our basic sequence modeling method. The NAIS algorithm can effectively solve the learning problem based on item-to-item collaborative filtering. It represents an item as an embedding vector and models the similarity between two items as the inner product of their embedding vectors, so a sequence can be expressed as a set of embedding vectors, and the results can be predicted using the embeddings of each item. On this basis, NAIS used the attention network to learn the different importance of interactive items.
As NAIS does, to characterize a user's preference according to his borrowing sequence E u , we represent each historical book as a real-valued low dimensional embedding vector p u t . Then, we can express the student's borrowing sequence as p u 1 , . . . , p u t , and denote a target book b i via an embedding vector p i . Moreover, we represent the student borrowing sequence by q u , and aim to calculate the probability of recommending book b i to students. When representing the features of the sequence q u , we use the attention mechanism and add an attention factor to each element p u t in the sequence q u , which can more clearly indicate students' different interests in each book.

The Sequence Modification Component
To filter out the noisy interactions from users' borrowing sequences, we resort to the reinforcement learning technique, which can transform the process of modifying the borrowing sequence into a hierarchical Markov decision process. Motivated by HRL [8], we also treat the process of modifying the student borrowing sequence as a hierarchical Markov decision process M, and divide it into two steps, high-level task M h and low-level task M l . The high-level task M h determines whether the entire sequence needs to be modified. If it needs to be modified, it enters the low-level task M l . The low-level task M l determines whether each element in the sequence should be deleted. After modifying the sequence, the agent will give a delayed reward according to the environment and the modified sequence.
Clustering component. While the method HRL [8] has obtained excellent results in the sequential recommendation, their work is not suitable for the book recommendation due to its data sparsity issue. In a digital library, the borrowing histories of students are very spare, that is, most of the users only borrow a few books. In this condition, the feature distribution of the data is relatively scattered. The high-level tasks and low-level tasks of hierarchical reinforcement learning tasks tend to delete all interactions, causing the basic recommender to fail to make correct recommendations. Therefore, after the pre-training of the basic recommender, we devise a clustering component to cluster the embedding of all the books used in the HRL component to make it more stable.
After the basic model is trained, each book gets its embedding. As our goal is to allow the sequence modification component to more accurately identify the noisy books, rather than deleting all the elements in the borrowing sequence, we need to cluster the embedding of the books. After clustering, the embedding p of each book becomes the embedding p c of the center point corresponding to this embedding. In this paper, the embedding of the book with the smallest serial number in each category is taken as the center point of the category. In other words, after clustering, the embedding of each book will become the center point embedding p c of this category, such as shown in Figure 3. We represent the borrowed book sequence E u after clustering as E u c . The sequence modification component will use the clustered embedding for training and modify the elements in the borrowing sequence. In the following, we will describe the key definitions used in our reinforcement learning component: Environment, state, action, decision-making, and reward [8].
Environment: The environment is regarded as the dataset of books and the pre-trained basic recommender.
State: In the high-level task, the high-level task determines whether the entire borrowing sequence needs to be modified, and the low-level task determines whether each borrowing record in the sequence needs to be deleted. The state of the low-level task is defined as the cosine similarity between the current borrowing sequence and the target book's embedding vector. The state of the high-level task is defined as the average cosine similarity and the average element-wise product between the embedding vector of each borrowed record in the borrowing sequence and the embedding vector of the target book. Besides, the basic recommendation model recommends the target book based on the probability value of the borrowing sequence to reflect the credibility of the target book. If the credibility is lower, the borrowing sequence should be modified.
Action and Policy: In high-level task, we define action a h as a h ∈ {0, 1}, action a h is a binary value that indicates whether to enter the low-level task and modify the borrowing sequence. The action a l t ∈ {0, 1} in the low-level task is also a binary value, indicating whether to delete this borrowing record. We perform a low-level action a l t according to the policy function as follows: π(s l t , a l t ) = P(a l t |s l t , are the parameters to be learned, d l 1 is the number of the state features and d l 2 is the dimension of the hidden layer. H l t is the embedding of the input state. We denote the parameter to be learned as Θ l = {W l 1 , W l 2 , b l }. σ is a sigmoid function, and it converts the input into probability. For high-level tasks, the policy function is similar to the low-level tasks, just change the parameter to Θ h = {W h 1 , W l 2 , b h }. Reward: The reward represents whether the action performed is reasonable. For lowlevel tasks, assuming that each action in the low-level task process has a delayed reward for the last action in the process, then the reward of each low-level action is defined as: where p(E u , c i ) is an abbreviation of p(y = 1|E u c , c i ) andÊ u c is the modified sequence after clustering. In the process of performing low-level tasks, the agent may delete all elements in the sequence. At this time, the model randomly selects an element from the sequence as the modified sequence. When you perform a high-level task if the high-level task chooses to modify this sequence, then the reward of the high-level task is the same as the reward of the corresponding low-level task, and if you choose not to modify it, the reward is zero.
In addition, the model defines an internal reward G within the low-level tasks, and the purpose is to make the agent tend to choose the course most relevant to the target course. G is calculated as follows: The model calculates the average cosine similarity between each sequence element and the target book after and before the sequence is revised, and then uses their difference as the internal reward G.
Objective Function: As HRL does, we aim at finding the optimal parameters of the policy function to maximize the expected reward: where Θ represents either Θ h or Θ l , τ is a sequence of the sampled actions and the transited states, P Θ (τ; Θ) denotes the corresponding sampling probability and R(τ) is the reward for the sampled sequence τ . The sampled sequence τ can be {s l 1 , a l 1 , s l where the reward R(a k , s k ) is assigned as R(a k t u , s k t u ) when a k = 1 and 0.

Joint Training
Through hierarchical reinforcement learning, we got the revised sequences of students' borrowing history. As our task is to leverage these modified interactions to fun-tune the basic recommendation model to get more accurate recommendation results, we send these revised sequences back to the basic recommender, and re-train the sequence modification component and the basic recommender jointly. Algorithm 1 shows the joint training strategy of our proposed method. Sample a sequence of low-level actions a l 1 , a l 2 , . . . , a l t with Θ l Compute R(a l t , s l t ) and G(a l t , s l t ) Compute gradients Update Θ 5. Jointly training, repeat the process of 2, 3, 4 steps to get the final output result.

Experiments
In this section, we first introduce the datasets we use, then describe the experimental environment and hyperparameters in our model, and finally give a brief introduction to our baseline.

Datasets and Experimental Settings
In experiments, we use two real-world datasets: School borrowing data from one digital library and Goodbooks (http://fastml.com/goodbooks-10k-a-new-dataset-forbook-recommendations/ accessed on 27 April 2021) to evaluate our proposed method.
School borrowing data contains 533,010 borrowing records. To ensure the recommendation quality, we filter out the students with less than 4 borrowing records and finally got 518,605 records, which are composed of 22,572 student borrowing sequences, containing 124,468 books. The statistics of this dataset are shown in Table 1.
As we can see from the data, the average number of books borrowed by each student is about 20, and most students borrow less than 20 books, but the number of books borrowed per book is only about 4 on average, which means that the book data is very sparse. If the common recommendation model is used to deal with sparse data, it is difficult for the model to accurately find books suitable for students. Therefore, our model has to do special processing on sparse data, so we use the clustering method to process the library data.
Another real-world dataset exploited in this work is the Goodbooks-10k dataset, which is a data set of book recommendations containing 912,705 reading records. After the same data processing, we finally get 898,195 reading records containing 10,000 books, and 41,314 users were finally obtained. The statistics of this dataset are shown in Table 1.
Data pre-processing. Since some users have only borrowed a small number of books, and a small number of data cannot form a sequence, we use the data of users whose number of borrowed books is greater than or equal to 4 in the two data sets. In the process of training and testing, we use the last item of these sequences as test data, and other items as training data to get the final result.

Implementation Details
We implement CHRL using Python and TensorFlow library with NVidia RTX 2080 GPU. The basic model uses the NAIS model. We set the recommender epochs to 20, the learning rate to 0.02, and the embedding size in the model to 16. The number of categories of the clustering model is set to 2000. The agent epoch of the agent in the hierarchical reinforcement learning pre-training is set to 50, and the learning rate of hierarchical reinforcement learning pre-training is set to 0.05. At last, we set up joint training in learning rate and delayed coefficient of 0.05 and 0.0005. The specific parameter settings in the school borrowing data are shown in Table 2.
Goodbooks data and school borrowing data show similar characteristics. We adjust the number of categories to 1000 because the number of books in the Goodbooks-10k data is less than the number of books in the school data. The other hyperparameters are the same as the experiment on school data. The specific parameter settings in Goodbooks are shown in Table 2.

Baseline Methods
To demonstrate the superiority of our proposed solution, we compare our method with the following baselines: CF [45]: Collaborative filtering algorithm is an algorithm that uses the preferences of a group of similar interests and common experience to recommend information that users are interested in.
FISM [46]: This is an item-to-item collaborative filtering algorithm, but no attention mechanism is used to distinguish the weight of historical data.
NAIS [44]: This is an item-to-item collaborative filtering algorithm, which uses an attention mechanism to distinguish the weight of historical data. Used as a basic recommendation model in this paper.
light-GCN [47]: This algorithm learns user and project embedding by linearly propagating user and project embedding on the user-project interaction graph, and uses the weighted sum of the embedding learned at all layers as the final embedding.
HRL [8]: Is an algorithm that uses the basic recommendation model and the hierarchical reinforcement model for joint training.
In experiments, we use the Hit Ratio of top K items (HR@K) and Normalized Discounted Cumulative Gain of top K items (NDCG@K) [8] as our evaluations metrics. To be more specific, HR@K is an index based on recall, which is used to measure the percentage of instances successfully recommended in top-K, and NDCG@K is an index based on accuracy, which illustrates the predicted position of the instance. In this article, we set K to 5 and 10, calculate all indicators including 1 positive instance and 99 negative instances, and get the average score of all user sequences.

Experimental Results and Analysis
In this section, we list the experimental results of our model and baseline methods, then analyze the experimental results, including the analysis of comparative experiments and the analysis of each component of the model.

Experimental Results
The experimental results are shown in Tables 3 and 4, from which we have the following observations: 1.
For school borrowing data, it can be seen from Table 3 that is in prediction performance, our model is better than the baseline methods, compared with the collaborative filtering algorithm, and our algorithm uses feature vectors to classify books, which can better predict what kind of books students like and recommend these books to students. Collaborative filtering algorithms cannot classify books, so it is difficult to recommend sparse data. For goodbooks data, the recommendation results of our model are significantly better than those of other models.

2.
The two algorithms of FISM and NAIS are item-to-item collaborative filtering algorithms. They use a deep learning method, and the NAIS algorithm adds an attention mechanism to the FISM algorithm, but sparse data will have a greater impact on the training process of these two algorithms, resulting in poorer final results. The results obtained by the light-GCN algorithm are slightly better, but not particularly perfect.

3.
The HRL algorithm combines reinforcement learning with deep learning removes noise in the sequence and improves the model's ability to process sparse data to a certain extent. Our CHRL algorithm deals with both sparse data and noise, and the results are better than other algorithms. In Goodbooks data, the HRL algorithms cannot get a good result, and our improved model CHRL can get better results than other models, which proves the effectiveness of our model.

Model Analysis
Analysis of clustering component. The main purpose of using NAIS in the basic recommender is to analyze the characteristics of books and classify them more accurately. From the experimental results, it can be seen that if only NAIS is used for the book recommendation, the recommendation results will be poor. The reason for this situation is that there are many noises in the sequence, and it is difficult to use the model to accurately locate the user's interest. However, the data can be accurately classified after NAIS processing, so that the clustering model can better process the book information. So using NAIS as a basic recommender is an important step in CHRL.
For book data with sparse data, HRL is not stable enough. We used library data to perform ten experiments on HRL and CHRL. Table 5 shows the average and standard deviation of the ten results. In ten experiments, HRL performed good results only twice, and the average of the ten experiments also performed poorly. This is because the reinforcement learning model deleted too much data in the sparse data sequence, resulting in too little data in the sequence. The model cannot correctly predict the next item in the sparse data sequence. As can be seen, our model in ten experiments has shown better results, indicating that our model on stability is also excellent. This is because clustering sparse data can make reinforcement learning more stable. This result proves the effectiveness of the clustering component in our model. Analysis of sequence modification component. From the comparison of the experimental results obtained by NAIS and our model, it can be seen that the use of hierarchical reinforcement learning in the book recommendation is effective. Hierarchical reinforcement learning can modify the student sequence to remove the noise in the sequence. This step is very important in the book recommendation with sparse data. Table 6 shows the recommended results after the basic model training, the sequence modification component, and the joint training. It can be seen that the sequence modification component has greatly improved the recommendation results. This is because the clustering module accurately classifies the book categories, and hierarchical reinforcement learning can well grasp the user's interest, and can accurately find the books that the user may like among many books. Analysis of joint training component. The last step of the model is joint training. The joint training feeds the modified sequence to the basic model for retraining, and then clusters the results of the basic model training and feeds it to hierarchical reinforcement learning. Hierarchical reinforcement learning modifies user borrowing sequence. Jointly training is such a cyclical process. After continuous training, the model got the final result. From the results, it can be seen that joint training is also an indispensable part of the entire model.

Analysis of Hyperparameters
For the analysis of hyperparameters, we focused on the number of categories in the clustering component and the learning rate in each component.
Categories in the clustering component. The hyperparameters of the clustering model are the number of categories. If the number of categories is set too low, many books with little relevance will be classified into one category, and the hierarchical reinforcement learning model will have difficulty distinguishing these books, which will affect the final result. If the number of categories is set too high, the classification ability of the clustering model will be weak, and the hierarchical enhancement model will delete too many items, which will easily lead to poor final results. We have done many parameter experiments, the results of the experiment are shown in the Table 7. The data in the Table 7 proves this conclusion. We finally set the number of categories in the school library data to 2000, and the number of categories in the goodbooks data is set to 1000. This number of categories will get a good result. Learning rate. In our model, both the basic model and the sequence modification module have learning rate settings. In the basic model, through multiple experiments on the basic model, we set the learning rate to 0.02 to get the fastest training efficiency. In the sequence modification component, we tested multiple data and found that a high learning rate would cause the model to be unstable, and a low learning rate will cause the model to train too slowly. Finally, we set the learning rate of the sequence modification component to 0.05. This learning rate can train this module stably and quickly.

Conclusions and Future Work
In this work, our purpose is to solve the problem of data sparseness and noise in book recommendations in the school digital library environment, so we proposed a hierarchical reinforcement learning-based method for the book recommendation task to solve the encountered data noise and data sparsity challenges. More specifically, we used clustering to classify the data and effectively solved the problem of sparse data in the school library environment. From the experimental results, we can observe that our model has a significant improvement over other related methods. The performance of our model on HR@10 and NDCG@10 have an improvement of more than 10% over other compared methods, which demonstrates the effectiveness of reinforcement learning-based recommendation method.
In the actual operation of the university digital library, it is necessary to input the existing data into the model for training, and then the model will recommend the corresponding books to the user based on the user's information. After the user selects a book according to his own preferences, the model will recommend the user the next time according to the user's new sequence. Since our model uses the borrowing sequence data and clustering model, when our model is used in areas other than book recommendation, the data needs to have time information and category information.
We will try to use this model in other areas of the school. For example, we will use our model to recommend a course selected by a student when the student selects a course, or use this model to recommend restaurant meals in a student restaurant. In the future, we will study the recommendation of combining library data with other school data, such as library data and course selection data or add knowledge graph data of library books to the model to make recommendations more accurate.