1. Introduction
With the prosperous development of information technology, nowadays, interactive recommender systems (IRS) play an important role in personalized services to help us discover information and satisfy our needs. TikTok’s personalized video recommendations (
http:www.tiktok.com, accessed on 10 April 2023), Spotify’s music recommendation (
http:www.spotify.com, accessed on 10 April 2023), and Amazon’s product recommendation (
http:www.amazon.com, accessed on 10 April 2023) are a few examples of how they have become an indispensable part in our daily lives [
1,
2,
3]. Different from conventional recommendation systems, an IRS suggests items based on user behavior and consecutively refines the recommendations based on users’ feedback [
4,
5]. Due to the interactive nature of such recommendation settings, IRS should be able to capture the dynamic preference of users’ behavior and perform planning to optimize long-term performance. Conventional recommendation methods such as matrix factorization, content-based filtering, and learning-to-rank regard the recommendation process as a static one and fail to capture the dynamic preferences of users. Furthermore, most of the conventional methods recommend items that can achieve immediate satisfaction of users, while ignoring those items that may lead to more profitable rewards in the future [
6,
7,
8].
Recently, reinforcement learning (RL) [
9] has shown great potential in modeling dynamic interaction behaviors and pursuing long-term rewards [
10,
11]. Naturally, as a promising method, RL has been introduced into IRS to solve the aforementioned particular challenges [
12,
13,
14,
15]. Typically, an RL-based IRS provides recommendations to users in an interactive way: each time, the IRS recommends a series of items to the user and the user browses these items and provides feedback; then, IRS refines the recommendation strategy and suggests new items depending on the feedback. The state modeling, which is frequently based on the users’ historical behaviors, is an important part of the RL agent. Existing state representation works for IRS generally adopted unidirectional recurrent neural networks (RNN) to model the states, and encoded the states with the hidden representation of the RNN model. Such works can be found in [
5,
6,
13,
16,
17]. However, real-world data are usually noisy and do not rigidly follow left-to-right orders [
18,
19]. The long sequence of user historical records always contains items that are irrelevant to users’ future choices, which refers to noisy patterns [
20,
21]. For example, consider the user’s action sequence in
Figure 1. Based on their historical choices, we can infer that they are probably a science fiction fan. Besides, they may also watch some movies that are newly released or suggested by friends, as depicted in timesteps 4 and 5. In the IRS setting, the RL agent obtains the user ratings 4 and 5 at timesteps 4 and 5 with the two irrelevant items, respectively. The unidirectional RNN state representation model may be confused by the two ratings and may ignore the fact that they are a science fiction fan. This situation may be aggravated with the increase of the sequence length, as the noisy patterns vary. Some works adopted attention-based methods [
4,
22] to address this issue. For example, DRR-att [
4] adopts an attentive network to deal with noisy dependencies. However, DRR-att directly inputs the item vector to the attentive networks and ignores the feedback from users, which makes it difficult for dealing with long sequence data. Therefore, it is essential to design a state encoder that can handle the long and noisy sequence in real-world IRS.
Furthermore, adopting RL methods for a recommendation scenario with vast discrete action space is challenging, because the IRS often contains millions of items. Factoring the large discrete action space into a considerably smaller one is an obvious solution. For example, DDPG-KNN [
23] mapped the discrete action space to a low-dimensional continuous action space and selected actions based on their similarities. Such works are inefficient as the computation of action similarities is time-consuming. An alternative strategy is to design a specific neural network to facilitate learning the tasks in large action spaces. For instance, DRN [
24] and DEERS [
6] adopt a refined Deep Q-Networks (DQN) [
25] to learn policies with large action spaces. However, learning the DQN-based methods always involves maximum operation over the actions, which is inefficient when the number of actions is large. Recently, TPGR (tree-structured policy gradient for recommendation) [
5] was proposed, which organized the policy into a tree structure and is efficient in both learning and decision making. However, TPGR becomes inefficient during evaluation, especially when calculating the top-K ranked items, which is serious in recommender systems since the computation of each action choice needs to traverse the tree.
To improve the ability and efficiency for dealing with long and noisy dependencies that occur in real-world applications, in this paper, an attention-based tree-policy recommendation (ATRec) method is proposed, which can effectively and efficiently capture user dynamics under long and noisy patterns. In addition, an efficient tree-structured policy network is devised, which can further improve the efficiency of TPGR with a refined tree policy model. Specifically, a complete tree to represent the policy and incorporate a parameter-sharing strategy is used to improve both the learning and evaluation efficiency. The proposed method is evaluated on three well-known benchmark datasets to illustrate the effectiveness and efficiency of the proposed methods. To summarize, the contributions of this paper are as follows:
An attention-based state representation model for IRS is proposed that can effectively capture the user dynamics even when the sequential data are long and noisy.
An efficient tree-structured policy is devised that can improve the learning and decision efficiency of TPGR through reinforcement learning.
The proposed model is evaluated with state-of-the-art methods and the results demonstrate that the proposed method is effective and efficient on three benchmark datasets.
3. Method
In this section, we first introduce the problem statement of RL-based interactive recommendation, and then demonstrate our proposed ATRec method in detail.
3.1. Problem Statement
In this work, we consider the recommendation task in which a recommender agent interacts with an environment (i.e., users). During the interaction process, the environment sequentially selects the items recommended by the recommender so as to maximize the cumulative reward of the recommender agent. Usually, the reward of the environment is set to the score that related to users’ feedbacks, such as the rating for the corresponding item. Consequently, we model the recommendation process as an Markov decision process (MDP), whose key components are . The details of each component are defined as follows:
State S: A state is defined as the historical interactions between the user and the recommender agent.
Action A: An action is the suggested items provided by the recommender agent.
Transition probability function P: is the transition function that determines the new state after the recommender agent suggests item a under observation s, which models the dynamics of user preference.
Reward function R: is a function that calculates the immediate reward received by the recommender agent after the user provides feedback when given the recommendation items a under state s.
Discount factor : defines the measurement of present value of long-term rewards.
In the recommendation process, the recommender agent interacts with a user by providing recommended items based on the user’s state and then receives an immediate reward that indicates the user’s feedback. Our goal is to find a recommendation policy that can maximize the cumulative reward for the recommender agent.
3.2. Attention-Based State Representation Model
In this section, we describe the proposed attention-based state representation model in detail.
The proposed state representation model consists of an input layer, a feature embedding layer, an attention layer, and an output layer, as shown in
Figure 2. The input layer receives the feature vectors from items and the users’ feedback, and then preprocessed them by a fully connected layer. The attention layer calculates the weights of each corresponding item. Finally, the output layer concatenates the user feature and the output context embedding vector of the attention layer. In the input layer, the historical
N interactions, consisting of the recommended items and their corresponding rewards along with the user ID are collected. Assuming that the recommender agent is performing
t-th recommendation, then
N previous item–reward pairs (from
to
) are used to encode the state. The item ID and user ID are mapped to a latent feature vector via matrix factorization. The feature vectors are denoted as the item feature and user feature, respectively, which are fixed during learning. The user feedback on the corresponding recommended items, such as ratings, is also added to the item feature. In the feature embedding layer, the input user features and item features are encoded by a fully connected layer to obtain the feature embeddings.
In order to capture the dynamics for different items in long and noisy sequential data, we adopt an attention layer to discriminatively capture the contributions of each item together with the user features. The attention layer aims at learning the integration weights for both item and user features. Denoting
as the
ith preprocessed item feature and
as the preprocessed user feature, the attentive context embedding is calculated as follows:
where
is the integration weight of the
ith item embedding vector with respect to the
tth target item and
is the integration weight for a specific user.
The attention weights are calculated through a Softmax layer as below:
where
and
are the weight and bias parameters for the attention layer, respectively.
Finally, in the output layer, we concatenate the input user feature and the attentive context embedding as the state embedding vector. The concatenate symbol is represented as ⊕ in
Figure 2.
Although the state representation model DRR-att in [
4] looks similar to our proposed model, compared to DRR-att, our proposed state representation model involves users’ feedback as the input of items. In addition, we also preprocess the input features by a fully connected layer and simplify the integration of user feature and item features. Our experimental results show that our proposed method outperforms DRR-att in our benchmark datasets.
3.3. Efficient Tree-Structured Policy
Most of the value-based RL methods are not efficient, since they involve maximization over the action spaces, which are often very large. In contrast, policy gradient methods are more efficient in solving IRS tasks. The reason is that, in policy gradient methods, policies are represented with the state as the input and the action probabilities as the output. Unfortunately, learning such a policy network with a large number of output dimensions is always time costly since the softmax output layer requires explicit normalization over the actions [
5]. Representing the output layer of the policy network with the tree-structured network can substantially reduce the computation cost in learning and evaluation, such as hierarchical softmax [
44] and TPGR [
5].
For TPGR, the policy network is represented by a tree-structured neural network, with each leaf node representing the output probability of the corresponding item and each none-leaf node outputs a probability for selecting the node in the below layer. The output probability of each item is calculated by multiplying the probabilities on the traversing path. One major drawback is that the computation of the probability to recommend each candidate item requires specifying the path from the root node to the corresponding leaf in the tree. Traversing the tree always needs a loop to determine the nodes in each layer, which is quite inefficient in practice. Specifically, let us first take a look at the tree structure of policy in TPGR. Denoting M as the total item number and c as the child number for each node, then based on the balanced clustering algorithm, TPGR obtains a tree-structured policy with items arranged into clusters. Here, returns the largest integer which is no less than x. Among those clusters, we have clusters that contains items, and clusters that contains c items. Since the item number for each cluster denotes the output dimension for each node in the final layer of the tree policy, such distribution of items in the leaf node requires layer-to-layer traversing of the tree. As a consequence, indexing each item needs decision time. For IRS, calculating the top-K ranked items for both evaluation and recommendation is very common in practice. In this scenario, TPGR needs decision time to obtain all the action probabilities and, finally, obtain the top-K ranked items. Such a computation method for calculating recommendation probability for candidate items lacks efficiency and is hard to parallelize. Therefore, deploying TPGR to real-world IRS is challenging.
To address this challenging issue, we propose a novel complete tree to better represent the structure of the policy, as illustrated in
Figure 3. In this framework, each non-leaf node of the policy tree receives the item feature list as input and contains two parts: the state representation model and a node policy network. The output of the leaf node denotes the recommendation probabilities of the corresponding items and is computed by traversing the tree from the root to this leaf. For each leaf node, clustering over items can also be used to assign similar items to one node, and thus, simplify learning. We will describe the clustering algorithm for the complete tree policy in the
Appendix A.The property of a complete tree makes it easy to index each root node of the decision path. If we store each node of the tree policy in an array, given an item index
I, the indices of the tree node from bottom to top can be obtained as
. Therefore, we can then compute all the item recommendation probabilities based on a few simple matrix computations. Algorithm 1 illustrates the computation of item recommendation probability based on the complete tree policy. Based on this algorithm, we can then obtain all the item recommendation probabilities in
decision time. In addition, we can also construct the computational graphs for the complete tree policy with popular deep learning toolkits such as Tensorflow (
http:www.tensorflow.org, accessed on 10 April 2023) and Pytorch (
http://pytorch.org, accessed on 10 April 2023), which makes it easy for parallel computing.
Moreover, during the computation of recommendation probability, we obtain a cumulative product of the tree node, which contains a series of state representation models. The backpropagation operation in such a network is time consuming, as it involves too many parameters. Hence, we design a parameter-sharing strategy to reduce the scale of the number of parameters and further improve the learning efficiency of the tree policy. For each node, the parameters of the state representation model can be shared through two levels:
All-shared: The parameters of the state representation model are shared across all the tree nodes.
Layer-shared: The parameters of the state representation model are shared across different layers; that is, for one layer of the tree, the nodes hold the same state representation model.
Note that the parameter-sharing strategy is not applied to each policy network, in order to keep the representation probabilities of each node. In the next section, we will illustrate the learning of the complete tree policy.
Algorithm 1 Decision making based on the complete tree policy |
- Input:
Input item feature list , node of the tree policy , , …, , tree depth d, child number c. - Output:
The probabilities of the top-M corresponding items. - 1:
- 2:
- 3:
for to d do - 4:
- 5:
for to do - 6:
{Compute the element-wise product between the two nodes’ outputs} - 7:
Add to - 8:
- 9:
end for - 10:
{Concatenate all the outputs for each layer} - 11:
end for - 12:
return The first M elements of
|
3.4. Learning Process
The learning process of the proposed method is illustrated in
Figure 4. Firstly, we use the historical user–item matrix to extract the features for each item and users. Specifically, we use Funk SVD [
45] to decompose the original user-rating matrix into an item matrix and a user matrix. The item matrix and user matrix are used as item features and user features, respectively. Then clustering is performed to separate the items into subclasses, in order to simplify the learning of the tree policy. As we use the complete tree to represent the recommender policy, the clustering tree can be constructed by applying a hierarchical clustering algorithm that can associate the leaf node with one item. We adopt a K-means-based clustering algorithm to build the clustering tree, which is depicted in Algorithm 2. Based on the clustering result, the complete tree policy is constructed, with each leaf node corresponding to a certain item id. Recommendations are provided to the user by mapping the output action of the policy to the items. During the interaction with the user, the tree-structured policy is trained through a policy gradient method.
Algorithm 2 K-means clustering algorithm for building the complete tree policy |
- Input:
a group of vectors and the child number of the complete tree c. - Output:
The clusters for each leaf node of the tree. - 1:
Initialize: mark all the input vectors as unassigned. - 2:
Use K-means algorithm to find c centroids: . - 3:
for to do - 4:
Find c nearest vectors to among unassigned vectors based on Euclid distance. - 5:
Assign to the ith cluster. - 6:
Mark as assigned. - 7:
end for - 8:
Assign the unmarked vector to the cth cluster. - 9:
Return: All c clusters.
|
The learning of the tree policy can utilize any policy gradient method; here, we use the REINFORCE [
46] algorithm to illustrate the learning process. Denoting the overall tree policy as
, since the learning objective is to maximize the expected discounted total rewards, the loss function can be written as:
Based on policy gradient theorem, the gradient with respect to parameter
can be written as:
where
is the probability of taking action
a at state
s and
is the expected discount rewards after taking action
a under state
s. During learning,
can be obtained by sampling the trajectories under policy
from either historical user data or online data. The whole learning algorithm is depicted in Algorithm 3.
Algorithm 3 Learning the complete tree policy |
- Input:
Complete tree policy , learning rate , discount factor . - Output:
The learned policy . - 1:
Initialize: Policy parameter . - 2:
repeat - 3:
Sample an episode from historical user data or online data. - 4:
for to n do - 5:
- 6:
Calculate based on Algorithm 1. - 7:
- 8:
end for - 9:
until Converge - 10:
Return:
|
4. Results and Discussion
In this section, the empirical study of the proposed method is given. Specifically, we first describe the setup of the experiments by preparing the experimental datasets and introducing the baseline methods and then evaluating the performance and efficiency of the proposed method. We intend to answer the following four research questions (RQ) through experiments:
RQ1: How does the proposed method perform compared with the state-of-the-art interactive recommendation methods?
RQ2: Does our method improve learning efficiency?
RQ3: Does the proposed state representation method improve the performance over the state-of-the-art methods?
RQ4: How do the different parameter-sharing strategies affect the performance of our model?
4.1. Experimental Setting
4.1.1. Datasets
We conduct experiments on three representative real-world benchmark datasets: Instant Video, Baby, and Musical Instruments, which are commonly used for testing the performance of IRSs. These datasets contain product reviews from Amazon (
https://jmcauley.ucsd.edu/data/amazon/, accessed on 10 April 2023) [
47,
48]. The ratings for each dataset are ranging from 0 to 5. Specifically, we use a quarter of each dataset for evaluation.
Table 1 lists the statistics of the three datasets. For each dataset, we use 80% of the data for training and the other 20% of data for testing.
Due to the interactive nature of IRS, an ideal way to conduct experiments is to directly interact with real users. However, online experiments may be too expensive and vulnerable to commercial risks for the IRS itself [
5,
14]. Following some of the existing works [
1,
4], we use an offline environment simulator based on offline datasets to conduct the experiments. At each timestep, the environment simulator provides the historical items and ratings of a user and provides feedback after the recommender system suggests items. The reward function of the environment is set to normalize the ratings of the user, which linearly normalizes the rating to
:
where
is the rating for user
i for item
j.
4.1.2. Evaluation Metrics
We report three commonly used evaluation metrics [
49,
50] in the experiments:
Average Reward: Since IRS aims to maximize the total reward of the episode, the average reward is a straightforward performance measurement. We adopt the reward over the top-K suggested items. If the top-K suggested items contain the item that the user selects, then the reward is set to the ratings commented on by the user. Otherwise, the reward over top-K suggested items is set to 0.
Hit Ratio
HR@K: HR measures the fraction of items that the user favors in the recommendation list and is calculated as below:
where we define
if the item user selects and favors (
) is in the top-K suggested items.
Mean Reciprocal Rank
MRR@K:
MRR@K measures the average reciprocal rank of the first relevant item. Denoting
as the rank of the first relevant recommendation item,
MRR@K is calculated as below:
4.1.3. Compared Methods
We compare our method with state-of-the-art IRS methods of different types, as listed below:
Popularity: Ranks the top k frequent items according to their popularity measured by a number of ratings, a simple but widely adopted baseline method.
SVD: Suggests recommendations based on singular value decomposition (SVD). For the IRS setting, the model is trained after each user interaction and gives recommendations with the predicted highest rating.
DDPG-KNN: A DDPG-based method that maps the discrete action space to a continuous one, then selects
K nearest items in the continuous space with the max Q-value obtained by the critic network [
23]. In our experiment, the
K value is set to
.
DQN-R: A DQN-based method that adopts a refined DQN to evaluate the Q-values of the items and chooses the item with the max Q-value [
24].
TPGR: Adopts a tree-structured policy and uses the policy gradient to optimize the tree-structured policy [
5]. This is the state-of-the-art IRS approach and is similar to our proposed method.
For the deep RL methods DDPG-KNN, DQN-R, and TPGR, we use Tensorflow version 1.4 for implementation. We will open-source our code after acceptance. The experimental details can also be found in
Appendix A.
4.2. Performance Evaluation (RQ1)
We investigate the recommendation performance of ATRec against the state-of-the-art baselines with respect to average reward@K,
HR@K, and
MRR@K. In this part, we fixed the length of each episode to 32. The experimental results are summarized in
Table 2, where the best result in each row is highlighted in bold. The proposed method performs clearly better than the comparison baselines on the three Amazon datasets, especially in the instant video dataset. The average improvement of
HR@30 is 42.3% and the average improvement of
MRR@30 is 21.4%.
The conventional methods such as popularity and matrix factorization obtain bad performance under the three datasets. DDPG-KNN method performs worse in , and improves at in both Instant Video and Baby datasets, where N denotes the total number of items. However, in the Musical Instruments dataset, which contains more items than the other two datasets, DDPG-KNN performs even worse at . The DQN-R performs much better than the DDPG-KNN method. TPGR is better than the other baseline methods in the three datasets.
Compared with TPGR, the proposed ATRec method outperforms TPGR in all three datasets, which can be explained for two reasons. First, the state representation model in ATRec can capture more context information in sequence than TPGR, which only uses the final states of RNNs. The attentive state representation model is more robust in dealing with the long and noisy dependencies lying in the data. Second, the state representation model in ATRec is fully learned with an online reinforcement learning method, while the state representation model in TPGR is trained using offline supervised learning. The online learning method of the state representation model can update the model with the growth of the data it encountered, making it able to adjust the dynamic change of user preference and also lead to better performance in both reward and hit rate.
4.3. Efficiency Evaluation (RQ2)
In this subsection, we evaluate the learning and decision efficiency of the proposed methods with the baseline methods. The length of each episode is also set to 32. We run each method on an AMD Ryzen 3600 6-core CPU with NVIDIA GeForce RTX2060 GPU. For the learning efficiency, we compare the average learning time per step for ATRec with three RL-based recommendation methods, i.e., DDPG-KNN, DQN-R, and ATRec, as shown in
Table 3. Among the baseline methods, TPGR performs best as it can improve the learning efficiency at
. With the help of parameter-sharing strategies, ATRec can greatly reduce the parameters in learning and thus improving learning efficiencies at an average improvement of 35.5% on the three benchmark datasets. Specifically, the full-online algorithm ATRec needs less time in learning than TPGR, of which the state representation model is off-line trained.
For decision efficiency, we specifically compare the two tree-based methods, i.e., ATRec and TPGR, and evaluate the average decision time per item and average decision time for top-K items. The results are shown in
Table 4. For TPGR, the decision time for each item is faster than ATRec, as it just needs a transverse from root to leaf and only needs to calculate the nodes over the path, while ATRec needs to calculate all the output of the tree nodes. However, the decision time for calculating top-K items needs more time in TPGR, as ATRec can obtain all the recommendation probabilities for each item in a single computation, while TPGR needs to calculate the probabilities for each item iteratively. Therefore, in practice, ATRec is more efficient and applicable in real-world recommendation systems than TPGR.
4.4. Influence of the Attention-Based State Representation (RQ3)
We also conduct experiments to show the influence of the attention-based state representation in ATRec comparing with four state representation models:
Caser [
51]: A popular CNN-based model for sequential recommendation by embedding the sequence with multiple convolutional filters to capture the user dynamics.
ATEM [
26]: An attention-based model for sequential recommendation. Compared to the proposed state representation model, ATEM ignores the user feature and feedback.
TPGR’s state representation model [
5]: An RNN-based state representation model that encodes the state with the final output of RNN.
DRR-att [
4]: An attention-based state representation model that uses an attention mechanism and average pool to obtain the user’s feature. Compared to DRR-att, our method introduces the user’s feedback and preprocessed the user and item feature with a fully connected layer.
To make a fair comparison, for each state representation model, we use the same tree policy network as ATRec. The only difference between the baseline methods and ATRec is the state representation model. The method is completely trained online. We report the hit rate@30 of the five state representation models under five different episode lengths: 8, 16, 32, 64, and 128. The experimental result is shown in
Figure 5.
For the two sequential recommendation models Caser and ATEM, poor performances were obtained with the IRS settings. The proposed ATRec clearly outperforms all the baseline methods in the three benchmark datasets. Compared to DRR-att, the preprocessing of the input user and item features improves the effectiveness of the state representation model. In addition, with the increase in episode length, our method can obtain better performances in Instant Video and Musical Instruments datasets. The performance of DRR-att falls as the length of the episode grows, especially in the Instant Video dataset.
We also notice that the ATRec method with TPGR’s state representation outperforms the original TPGR, which implies that the complete tree policy with a parameter-shared online learning framework is more effective than the offline one.
4.5. Effect of Different Parameter Sharing Strategies (RQ4)
In this subsection, we evaluate the effect of different parameter-sharing strategies: layer-shared and all-shared. We report the performance and efficiency of the two strategies, as depicted in
Table 5. To make a fair comparison, ATRec methods with two sharing strategies only differ in sharing strategy. We notice that in most cases, all-shared strategy is better than layer-shared strategy, both in performance and efficiency, except in the baby dataset. Therefore, we can conclude that in most cases, the layer-shared strategy with ATRec is enough for IRS policy representation. The unshared tree structure in TPGR is not efficient in learning, as it contains redundant parameters.