Predicting Dynamic User–Item Interaction with Meta-Path Guided Recursive RNN

: Accurately predicting user–item interactions is critically important in many real applications, including recommender systems and user behavior analysis in social networks. One major drawback of existing studies is that they generally directly analyze the sparse user–item interaction data without considering their semantic correlations and the structural information hidden in the data. Another limitation is that existing approaches usually embed the users and items into the different embedding spaces in a static way, but ignore the dynamic characteristics of both users and items. In this paper, we propose to learn the dynamic embedding vector trajectories rather than the static embedding vectors for users and items simultaneously. A Metapath-guided Recursive RNN based Shift embedding method named MRRNN-S is proposed to learn the continuously evolving embeddings of users and items for more accurately predicting their future interactions. The proposed MRRNN-S is extended from our previous model RRNN-S which was proposed in the earlier work. Comparedwith RRNN-S, we add the word2vec module and the skip-gram-based meta-path module to better capture the rich auxiliary information from the user–item interaction data. Speciﬁcally, we ﬁrst regard the interaction data of each user with items as sentence data to model their semantic and sequential information and construct the user–item interaction graph. Then we sample the instances of meta-paths to capture the heterogeneity and structural information from the user–item interaction graph. A recursive RNN is proposed to iteratively and mutually learn the dynamic user and item embeddings in the same latent space based on their historical interactions. Next, a shift embedding module is proposed to predict the future user embeddings. To predict which item a user will interact with, we output the item embedding instead of the pairwise interaction probability between users and items, which is much more efﬁcient. Through extensive experiments on three real-world datasets, we demonstrate that MRRNN-S achieves superior performance by extensive comparison with state-of-the-art baseline models.


Introduction
In the era of big data, the large volume of online information generated in real time makes it very difficult for people to quickly find the valuable information they need. The recommendation algorithm is effective to solve the current information overload problem by recommending useful information to users while filtering out the less relevant information [2,3]. For example, in some online multimedia websites, user experience can be greatly improved by recommending the movies or songs to them which they may be interested in [4]. In many e-commerce platforms, recommending products that users would like to buy can help save their time and money [5]. Thus, accurately predicting the interactions between users and items is critically important to recommendation [6][7][8]. Figure 1a shows an example of the sequential interactions between two users (Bob and Alice) and the items. One can see that Bob first buys a cell phone, and shortly he buys a phone case and a phone film. Thus, we can infer that he is more likely to buy a mobile earphone rather than a suit. Figure 1b shows an example of user-item interaction on an e-commerce platform. Each arrow represents an interaction from a user to an item, e.g., a user buying or browsing a commodity on e-commerce platform Taobao, with each interaction associated with a timestamp t and a feature vector f (such as the interaction types, users and commodity features). An interaction between a user and an item can be a user clicking, buying, or browsing a certain commodity on the online shopping website. (a) Illustration of two user interaction sequences: Bob buys a cell phone, a phone case, a phone film, and a mobile earphone; Alice buys a suit, a dress, a shoe, and a hat, successively. (b) A toy example of an interaction network containing three users and four items. Each arrow represents an interaction from a user to an item. Each interaction is associated with a timestamp t and a feature vector f (such as the feature of the commodity).
Increasing research efforts have been devoted to the research topic of recommender system and great progress has been made recently [9,10]. Inspired by the successful applications in natural language processing (NLP), Hidasi proposed to use the GRU module to process the session-based behavior sequence of users for predicting the next item [11]. The interaction sequence is input into an LSTM module for adapting the dynamics of users and items. Wang et al. [12] applied the graph neural network architecture to knowledge graph to acquire rich auxiliary information for the recommendation task. However, there are three major issues when applying such methods directly to our studied problem. First, as shown in Figure 1a, the successive online behaviors of users are usually highly correlated but are largely ignored by existing collaborative filtering-based recommender systems [13,14]. Intuitively, if a user first buys a suit and then buys a pair of shoes, the next item that the user is more likely to buy in the near future is an apparel product rather than an electronic product such as a mobile phone. Second, some existing works only pay attention to exploring the interaction relationships between users and items, while ignoring the relationships between users and users as well as items and items. User-user and item-items interactions are important auxiliary information to solve the data sparsity issue. Third, the users' preference and the item popularity evolve over time [15,16]. Existing works mostly learn a static represent vector for each user and item but fail to capture the dynamic representations of users and items that evolve over time.
To address the above issues, in this paper we propose a Meta-path guided Recursive RNN based Shift embedding method named MRRNN-S to more effectively learn the dynamic representations of users and items, based on which the future user-item interactions can be more accurately predicted. Inspired by the previous work [1], we argue that the previously proposed RRNN-S model can still be further improved in terms of processing the original data. RRNN-S only utilized the second-order graph structural information but ignored the fact that the sequence data and the heterogeneity hidden in the graph can also provide auxiliary information. Therefore, we propose to add two new modules, namely the word2vec module and the skip-gram based meta-path module to the RRNN-S model. The word2vec module treats the sequential interaction data as sentences and then extracts the semantic information of the sentences by using the word2vec method. The ship-gram-based meta-path module models the user-item interaction data as a heterogeneous graph to more effectively capture the heterogeneous information, and then learns the node embeddings on the graph. MRRNN-S first uses a word2vec module to capture the features from the user-item interaction sequence. A skip-gram-based meta-path module is proposed to capture the heterogeneous information and higher-order proximity from the user-item interaction graph. Next, a recursive RNN module is also designed to catch the sequential dependence of user-item interactions simultaneously by mapping users and items into the same latent representation space. The embeddings of users and items are mutually and iteratively updated by the proposed recursive RNN. Then, a shift embedding module is designed to predict the continuous future embedding of a user through the time interval and then predict the user embedding. Finally, we predict the embedding of the item and identities the item whose embedding vector is closest to it in the embedding space.
To summarize, our main contributions are as follows: • To acquire the rich auxiliary information from the user-item interaction data, we model the original user-item interaction data as sequences and graphs, respectively. A word2vec module is proposed on the interaction sequence of each user, which aims to learn the initial embedding that preserves the sequential pattern and semantic information from sequences. Then a skip-gram-based meta-path module is proposed to the user-item interaction graph for capturing the heterogeneous information and the higher-order user-item relationships. • We propose to apply the GCN module to learn the node features of the user-item interaction graph, so that the similar users or items are closer in the feature space. • Comprehensive experiments are conducted over three user-item interaction graph datasets. The result demonstrates the effectiveness of our method against several competitive baselines.
The proposed MRRNN-S is an extended version of the RRNN-S model which was proposed in our earlier paper published in ADMA2020. Next, we briefly describe the difference between this paper with our previous conference paper. We extract the static embedding with a newly designed module to better capture the auxiliary information. The previous work [1] only considered the structure information from the subgraphs extracted from the user-item interaction graph when exploring the static user and item embeddings. This work also improves the model's capability of feature extraction by transforming the original data into the interaction sequence of each user and the user-item interaction graph, which can more effectively learn the embeddings by preserving the sequential information as well as the heterogeneous graph information. We also re-conduct most of the experiments and add several new experiments to demonstrate the superiority of MRRNN-S.
The remainder of this paper is organized as follows. We will first discuss related works in Section 2. Then Section 3 will give some notations and a formal definition of the studied problem. Section 4 will introduce the proposed model MRRNN-S and the objective function. In Section 5, we will evaluate our approach and report the results. Finally, the conclusion will be given in Section 6.

Related Work
In this section, we review related works from the aspects of collaborative filtering recommendation, deep learning-based recommendation, graph-based recommendation, and temporal network embedding.

Collaborative Filtering Recommendation
The collaborative filtering (CF) algorithm is one of the most classic models in recommender systems. Its main idea is to obtain the collective wisdom from a large number of user behavior data for recommendations. CF can be roughly divided into user-based collaborative filtering [17], item-based collaborative filtering [18], and model-based collaborative filtering [19]. Linden et al. proposed an item to item collaborative filtering algorithm [20], which matched the items interacted by a user to the similar items, and then put all similar items into a recommendation list. High-quality real-time recommendations can be generated because the number of online users in the e-commerce platform is irrelevant to the number of items in the item catalog. A model named neural collaborative filtering (NCF) is proposed to improve the capability of feature interaction learning by replacing the inner product operation in the matrix factorization model with a neural network [21]. However, most of the existing collaborative filtering models ignore the latent sequential patterns when dealing with the dynamic user-item interaction data.

Deep Learning-Based Recommendation
Due to the powerful feature extraction capabilities of deep learning techniques, a lot of recent works combined deep learning with recommendation systems and achieved promising performance [22]. The recent deep learning-based recommendation models can be roughly categorized into RS with neural building blocks (e.g., MLP, AE, RNN, CNN, etc.) and RS with deep hybrid models (e.g., RNN + CNN, AE + CNN, etc.). Ref. [23] jointly trained the wide linear models and deep neural network to combine the benefits of memorization and generalization for recommender systems. Ref. [24] proposed a model that integrated CNN into probabilistic matrix factorization, which was able to capture the contextual information of documents and improved the prediction accuracy. Ref. [25] designed a flexible encoder-decoder architecture which consisted of CNN and RNN. The model is capable of incorporating author metadata to learn a robust representation of the citation context. In order to acquire rich auxiliary information, ref. [26] combined a denoising auto-encoder model with a convolutional auto-encoder model to extract the textual and visual features of items, respectively. However, these works only focus on learning static user and item embeddings, which is not suitable for a temporal user-item interaction network. The sequential pattern is hidden in temporal user-item interaction reflects the dynamics of user preference for an item over time. Our work aims to capture the sequence of dynamic embeddings of users and items.

Graph-Based Recommendation
Recently, the related research on graph neural networks has made great progress in various domains [27][28][29]. Considerable works tried to apply the graph neural networks in the recommender systems because the user-item interaction data can be modeled as a graph. Sun et al. used the Bayesian graph convolutional network BGCN to model the uncertainty in the user-item interaction graph to solve the problem of unreal connections between some nodes in the graph [30]. It is verified that the feature transformation and nonlinear activation operations of traditional GCN are invalid for collaborative filtering, and thus a lightweight GCN model (LightGCN) is proposed for recommender systems [31]. Fan et al. preserved the sequential pattern in the interaction graph by assigning timestamp attributes to the edges in the user-item interaction graph. Inspired by the transformer, the TCT model was proposed to combine the sequential pattern in the data with the collaborative signal [32]. In order to obtain richer auxiliary information from a graph, more and more researchers pay their attention to heterogeneous graphs. Jiang et al. proposed a novel contrast GNN pre-training strategies on heterogeneous graphs, which was able to capture the semantic and structural information in a self-supervised learning way [33].
Since there are few studies on the dynamic embedding methods of heterogeneous graphs, Zhang et al. proposed the MDHNE model, which converted the heterogeneous graph into multiple views, and retained the evolution mode of the relationship between multiple views over time [34].

Temporal Network Embedding
Considering that the preference of a user and the popularity of an item can both change over time, increasing research interests have been devoted to temporal network embedding [35,36]. For example, Li et al. proposed the DANE model, which combined network topology and node features to achieve rapid dynamic update and learn dynamic network embedding [37]. Zhu et al. designed the DHPE model based on the generalized SVD and matrix perturbation theory [38], which was able to preserve the high-order proximity while dynamically updating the node representation of the network. However, these algorithms learn embeddings from a sequence of graph snapshots, which is not suitable to our setting of the successive user-item interaction data. With the development of the NLP techniques [39,40], some NLP methods are also applied to recommender systems. For example, inspired by the skip-gram model, Nguyen et al. proposed a model which was based on the embedding method with temporal random walk [41]. It aims to learn the more meaningful time-respecting embeddings from continuous-time dynamic networks. The drawback of the model is that it only generates the final static embedding of nodes. Recently, Xu et al. designed a temporal graph attention model (TGAT) to aggregate the temporal-topological neighborhood features and learn the time-feature interactions [42]. TASER proposed by Ye et al. is able to model the absolute time pattern and relative time pattern. The former highlights the users' time-sensitive behavior, and the latter shows the effect of the time interval on the relationship between two actions [43].

Preliminary
In a typical user-item interaction scenario, we use u t ∈ R n ∀ u ∈ U to denote the user embedding and i t ∈ R n ∀ i ∈ I to denote the item embedding, where U and I are the sets of users and items, respectively. Interaction between users and items is an ordered sequence and is denoted as S. One historical interaction record is denoted as S = (u, i, t, f) ∈ S, where u and i denote a user and an item in U and I, separately. t is the timestamp of interaction S. Each interaction has an associated feature vector f (e.g., the embedding of user, item or interaction information). Table 1 lists the symbols and their descriptions used in this paper.

Symbols Description
S the set of user-item interaction U and I the set of users and items u t and i t the dynamic embedding of user and item at timestamp t u and i the static embedding of user u and item i E u and E i the static embedding matrices of users and items u t+ the predicted embedding of user u at time t + i t+ the predicted embedding of item i at time t +

Problem Definition
According to the notations above, we can formulate our problem as follow. Given a set of historical user-item interactions S, our aim is to learn the future embeddings u t+ and i t+ for users and items, respectively; and predict with which item the user u will interact in a given future time slot t + .

Methodology
In this section, we will introduce our model which contains three parts as shown in Figure 2. The first part is the auxiliary information extraction from the historical useritem interactions. Next, the second part is the generation of the dynamical user and item embeddings. Finally, the objective function of the model will be introduced.

Auxiliary Information Extraction
For the user-item interaction data, auxiliary information can be acquired from the sequences and the graphs, which will be introduced in detail as follows.
Features extraction from the interaction sequence. We can regard the interaction sequence of one user as a sentence and then learn the sequential information from it. As shown in the white box of Figure 2, each item interacted by the user in the sequence can be regarded as a word in a sentence. Thus, we can project the representations of items and users to a common latent space by word2vec. For example, as illustrated in Figure 3, this is an interaction sequence of user Jack with items. He first buys a basketball, and shortly he buys a basketball jersey, sneakers, and knee pads. Obviously, these four items share some common features because they all belong to the category of sporting goods. Thus it is meaningful to make them close to each other in the projected latent feature space. Given an interaction sequence of user j which is denoted as x j = [z 1 , z 2 , ..., z T ], where z t ∈R d 0 ×1 denotes the t-th item interacted by user j , d 0 is the initial dimension of item. For each position t = 1, ...,T, our task is to predict the context words within a window of size m, given the center word z c .
The likelihood function can be formulated as follows The objective function J(θ) is the average negative log likelihood as follows Our goal is to find the parameters θ to minimize the objective function J(θ), and we will acquire the initial embedding of users and items which contains the sequence information.
Features extraction from graphs. The user-item interaction data can be also treated as a graph. As shown in the blue box in Figure 2, the user-item interaction graph consists of users' historical interaction data. There are two types of nodes including users and items. A solid line connects one user node and one item node if the user interacts with the item. As depicted in Figure 4, one can see user1 interacts with item1, item2, and item4 while user2 interacts with item2, item3, and item4 in this user-item interaction graph. It is reasonable to infer that item2 and item4 are more similar because both of them are interacted by users at the same time. When user3 interacts with item2, it is appropriate to recommend item4 to user3. Thus, the user-item interaction graph is able to preserve rich auxiliary information. In addition, in order to make use of the heterogeneity, inspired by [44], we define the metapath which is able to reflect the semantic information between nodes in the graph. For example, given the metapath U-I-U, (U-I) means a user interacts with an item, and (I-U) means the item is interacted by another user. We can sample some instances of the predefined metapaths in the graph by random walk, and embed each node with a skip-gram model to capture the heterogeneity and structure information from the graph. Given the user-item interaction graph G(V, E, T), in which each node v and each link e are associated with their mapping functions φ(v) : V → T V and ϕ(e) : E → T E , respectively. For a given node v, we aim to maximize the probability of correctly predicting its' neighborhood nodes N t (v), t∈ T V . It is helpful to use skip-gram model as follows to learn the node representations which preserve the heterogeneous semantic information arg max where N t (v) denotes the neighborhood of node v of tth type. The node type can be a user or item. After the two auxiliary features extraction methods described above, we obtain the static embedding of each user node and item node with rich semantic information. Then, a GCN module is applied to the user-item graph for learning structure information. Different from the general GCN which aggregates the features of the first order neighbor nodes, we design a GCN that only aggregates the second-order neighbor nodes' features by considering the fact that the user-item graph is a bipartite graph. In this way, each node is able to obtain the features of its nearest neighbor nodes of the same type, i.e., the user nodes only receive the information from the nearest other user nodes through common item nodes, and so do the item nodes. To be specific, the information aggregation can be expressed as follows where H l is the matrix of hidden representations of users and items in layer l, and A is the adjacency matrix. D is diagonal degree matrix of A. W l is the trainable weight matrix, σ is a non-linear activation function, and L is the number of layers. The final outputs h l = [u; i] contains the static user embeddings and the static item embeddings. Finally, the static user embedding u and the static item embedding i is concatenated with the user-item interaction feature vector f respectively to generate the auxiliary information o u and o i for learning dynamic user and item embeddings.

Dynamical Embedding Generation
In this section, we will introduce how to generate the dynamical embeddings of users and items based on the learned static embeddings learned above.
Recursive RNN. As illustrated in the green box of Figure 2, there are two recursive RNN models. One is the UserRNN and the other is ItemRNN. The two components are designed to generate user and item dynamic embeddings according to their historical interaction, and they are shared by users and items to mutually learn dynamic embeddings of users and items. A user/item RNN is composed of RNN layers and the hidden states of the user/item RNN are used to represent user/item embeddings.
In the recursive RNN module, the user and item embeddings will be updated by the user RNN module and item RNN module, respectively, when an interaction between a user and an item occurs. To be specific, the user embedding u t will be updated based on the user embedding u t−1 , the item embedding i t−1 and the auxiliary feature vector o u at the previous timestamp t − 1. The item embedding is updated with the item RNN in a similar way to the user RNN. In this way, the user or item embeddings are able to absorb the information hidden in the interaction data. The user and item embeddings can also be encoded into the same latent space. Note that the embeddings of users and items evolve with the dynamic user-item interaction. There are two advantages in the designed recursive RNN model. First, users and items are embedded into the same latent space, and thus the similarity between users and items can be easily obtained through measuring their distance in the same embedding space. For example, if users A and B interacted with the items C and D respectively, and item C is similar with item D, we consider the users A and B are also similar. Based on this idea, the embeddings of users and items can be updated by the following formula iteratively.
In Equation (5), u t and u i are dynamic embeddings of user u and item i at time t. o u and o i are auxiliary features. σ is a sigmoid function. W u 1 , ..., W u 3 are the parameter matrices of user RNN, and W i 1 , ..., W i 3 are the parameter matrices of item RNN. Shift embedding. In order to predict the embedding of the user in the future, we design a shift embedding module that works as an embedding projection operation. The predicted embedding can then be used for downstream tasks, such as link prediction or recommendation. Existing works ignore the importance of temporal information and are not capable of continuously updating the interaction embeddings over time. The embeddings are only updated discretely when new interactions occur. However, we argue that users and items can evolve over time continuously such as the interests of users, attributes of items, and so on. We assume that the user and item embeddings can still change smoothly and continuously over time even there is no new interaction between a user and an item.
The part in the box of Figure 2 shows the shift embedding module which is able to capture the temporal dynamics of user embedding by considering the elapsed time information. If the time gap between two successive interactions of one user is large, the user embedding learned from the previous interaction is not appropriate to predict the current and continuously adjust the user embeddings between the time interval of two successively interactions with items. In other words, recent interactions have a greater impact on the user embedding, while interactions too long ago have a weaker impact. Inspired by previous work, the feedback loop of RNN keeps the previous information of hidden states as internal memory. First, we use a linear layer to obtain the internal memory needed to be adjusted u S t . Then it is adjusted by the elapsed time u S t . Finally, to compose the shifted embedding, the adjusted internal memory is combined with the original user embedding, ( u t+∆ = u t + u S t ). Details of the shift embedding module are given below: where ∆ denotes the time interval since last previous user-item interaction, W s is the parameter matrix of the linear layer and b is bias. The function g(∆) = W p · log(e + ∆) is used to convert ∆ to a time-context vector and W p is trainable parameters. u t+∆ is the predicted user embedding at time t + ∆.

Overall Objective Function
In the final prediction step, we aim to predict the embedding J t of the item that a user will interact with. Most existing models that give the highest interaction probability among all user-item pairs usually need to accomplish the neural-network forward process for each item, which is very time-consuming. Instead of searching the highest interaction probability from all the user-item pairs, we directly outputs the item embedding vector. Thus, our model only needs to accomplish the forward process once and predicts the item embedding in the shift embedding module, and then the item whose embedding vector is closest to the predicted one in the embedding space will be selected. Thus our model is much more efficient than existing models. The item embedding prediction function is given as follows.
where W 1 , ..., W 4 are trainable parameters and b is a bias vector, u and i represent static embeddings of users and items, respectively. u t+∆ is the output of shift embedding module. Note that i t+∆−1 is the dynamic item embedding before t + ∆.
For training the parameters of the model, we minimize the L 2 difference between the predicted item embedding j t and the ground truth ground item embedding j t at every interaction. We aim to minimize the following loss.
The first loss term is the error of the predicted embedding vector. To prevent the embeddings of users and items change sharply, the last two terms are embedding smoothness regularization, and λ U and λ I are scaling parameters.

Experiment
In this section, we evaluate our model on three real-world datasets: Wikipedia edits, Reddit posts, and JingDong online business. We will first introduce the datasets, the baselines, and the experiment setup, and then discuss the experiment results.

Dataset
The Wikipedia dataset, the Reddit dataset, and JingDong dataset are used for evaluation, and Table 2 presents the details of three datasets. • Wikipedia editing dataset: This dataset contains one month of edits made on Wikipedia pages. We select the 1000 pages that get the most edits as items and editors who made at least 5 edits as users. In total, we have 8227 users. There are 157,474 interactions between the selected users and pages in total, and the edited text is considered as features.

Baselines and Evaluation Metrics
We compare the proposed method with the following baseline models.
• LSTM [45] is an important ingredient of RNN architectures. Here we simply record the sequence of items, dropping of the time information. • Time-LSTM [45] is a new LSTM variant, which equips LSTM with time gates to model the time intervals. • Jodie [36] is a coupled recurrent neural network model to learn the dynamic embeddings of users and items. Here we ignore the one-hot embedding for item in Jodie, because it cannot be utilized in a large number of items. • NGCF [28] is a recommendation framework based on a graph neural network, which explicitly encodes the collaborative signal in user-item bipartite graph by performing embedding propagation. • LightGCN [31] is a state-of-the-art collaborative filtering based method. It simplifies the design of GCN to make it more concise and appropriate for recommendation. • RRNN-S [1] is a recent state-of-the-art recursive RNN based shift embedding model for predicting dynamic user-item interaction.
We use 80% data for training, 10% for validation, and the remaining 10% for testing. We adopt mean reciprocal rank (MRR) and Recall@K defined as follows as the evaluation metric. MRR is a statistic measure for evaluating any process that produces a list of possible responses to a sample of queries, ordered by the probability of correctness. It is the average of the reciprocal ranks of results for a sample of queries Q: , where rank i refers to the rank position of the first relevant document for the i − th query. Recall@K measures the fraction of the total amount of relevant instances that are actually retrieved.
The embedding dimension is set to 128, the learning rate is 0.001, and the model is trained with Adam optimizer with a weight decay of 0.00001. The loss curves of the training process for the two datasets are shown in Figure 5. It shows that the model converges quickly. Within around 10 epochs, the training loss first drops quickly and then becomes stable.  Table 3 shows the results of our model and the baseline models. We observe that MRRNN-S outperforms the baselines in most cases on the three datasets. It is also worth noting that our model performs significantly better than other models on the JingDong dataset, which is a relatively larger dataset. The best results are highlighted with bold font, and the best results achieved by baselines are underlined. Among the baselines, LSTM only uses the interaction order information in the item sequence, and it does not take the time interval length of two successive interactions into consideration. As a variant of LSTM, Time-LSTM incorporates the time interval information into sequence data learning. Thus, Time-LSTM outperforms LSTM by 12.7% in the Wikipedia dataset and 12.2% in the JingDong dataset. Compared with LSTM and Time-LSTM, NGCF and LightGCN are collaborative filtering-based methods, which are not able to capture the time information in the temporal user-item interaction network. Thus the two methods are not suitable for the dynamic user-item interaction prediction and their performance is the worst among all the methods. Therefore, LSTM and Time-LSTM perform better than NGCF and LightGCN. Jodie considers dynamic and sequentially dependence between user-item interactions, which means richer auxiliary information can be absorbed into the model for a more accurate prediction. RRNN-S is able to capture the user-user and item-item relationships which is helpful to learn the similarity between nodes from high-order proximity. It improves the performance by 4.2%, 2.1%, and 9.2% on the three datasets compared with Jodie, respectively. One can also observe that MRRNN-S not only obtains sequence patterns from sequence data but also explores semantic information and structural information from heterogeneous graph data. Thus MRRNN-S can acquire more detailed auxiliary information which results in MRRNN-S outperforming RRNN-S. As the size of the dataset increases (the number of interactions in the dataset), the performance improvement of model MRRNN-S becomes more and more obvious, especially in the JingDong dataset.

Ablation Study
To investigate whether the components in our proposed model are all useful, we further compare MRRNN-S with the following four variants.
• MRRNN-1 drops the meta-path module which is able to acquire the heterogeneity and semantic information from user-item interaction graph. Only the embedding learned by the word2vec module is fed to the prediction model. • MRRNN-2 drops the word2vec module which can capture the sequential information from the interaction sequence. Only the feature vectors extracted by the meta-path module are fed to the remaining part of MRRNN-S. • MRRNN-3 drops the GCN module which is designed to catch the structural information from the user-item interaction graph. Only the feature vectors processed by the meta-path module and word2vec module are input into the model. • MRRNN-4 drops the meta-path, word2vec, and GCN modules at the same time, and randomly initializes the embedding by a normal distribution as the model input.
As shown in Table 4, one can see that the word2vec module, the meta-path module, and the GCN module are all useful because the performance will decrease when any one of them is removed. In order to show the influence of each module on MRRNN-S more intuitively, we draw a histogram of the percentage of each variant's performance to the performance of the full model over the two datasets presented in Figure 6. Through Figure 6, one can clearly observe that the sequential information hidden in the interaction sequence seems to be more important on both two datasets because ignoring such information will result in a remarkable performance decline in terms of MRR and Recall@10. One of the possible reasons is that the data sets are mainly composed of sequence data, while the graph data structure is relatively simple. In particular, there are only two types of nodes that can be utilized. In addition, one can also observe that the performance of MRRNN-2 on the two datasets is quite different, with the performance on the Wikipedia dataset significantly better than the performance on the JingDong dataset. This is probably because of the contradiction between the large amount of data and too few types of nodes in the JingDong dataset, which makes the model more difficult to acquire useful auxiliary information from the sparse heterogeneity. Overall, MRRNN-S achieves the best performance in both Wikipedia and JingDong datasets when the three modules are combined, which verifies that the four proposed modules are all helpful for improving the model performance.

Parameters Sensitivity Analysis
In this section, in order to analyze the impact of the parameters on model performance, we design the following experiments to study the sensitivity of our model on the embedding dimensions and the parameters λ I , λ U . Figure 7 depicts the MRR under different dimensions of embedding on the JingDong dataset. It shows that the best performance is achieved when the dimension is set to 64. From the trend of the performance curve in the figure, one can see that as the embedding dimension increases, the performance of the model becomes better and better. This may be because embedding with a higher dimension is able to contain more information. However, when the dimension continues to increase, the performance of the model decreases. This may be because a too high dimension will lead to overfitting.  Figure 8 shows the influence of parameters λ I on Recall@K and the experiment is conducted on the Wikipedia dataset. In Figure 8a, λ I varies from 0 to 1 and we fix λ U = 1. One can see that our proposed model MRRNN-S achieves the best performance among all recall@K when λ U = 0.2. Next, we further study the effect of parameters λ U on model performance when λ I = 0.2. The result is presented in Figure 8b. It shows that MRRNN-S achieves the best performance when λ U = 1. Furthermore, in order to study the robustness of our proposed model MRRNN-S, we vary the percentage of training data in the prediction task on the Wikipedia dataset. To be specific, when other parameters are fixed, the proportion of the training data is varied from 10% to 80%. In each case, the 10% interactions of the training data are used as validation, and the next 10% interactions data is used as testing. The purpose of the above experiment setting is to explore the performance under the same scale of testing data. Figure 9 shows the change in the MRR and Recall@10 on the Wikipedia dataset with the training data proportion increasing. One can see that the performance curves always fluctuate around 0.8, which means the performance of our model is stable even if the proportion of training data is changed.

Conclusions
In this paper, we proposed a novel Metapath-guided Recursive RNN based Shift embedding method named MRRNN-S for predicting which item will be interacted with by a user in the future. To capture the sequential information from the user-item interaction sequence data, we project each user and item into a latent space by word2vec module. The heterogeneity hidden in user-item interaction graph is also extracted by a meta-path module for providing richer auxiliary information. A recursive RNN is utilized to learn the user and item embedding by considering both dynamic and static features. Additionally, we designed a shift embedding module that is able to incorporate the time interval information for predicting future user embedding. Experimental results on two real-world datasets demonstrated the effectiveness of our model. Compared with the results from our previous paper [1], MRRNN-S achieves superior performance on a large dataset. It verifies the effectiveness of adding the two new modules to obtain more latent information from the sequence data and graph data respectively. Thus better feature representations of users and items are obtained by MRRNN-S. In the future, it would be interesting to further study whether the current framework for data modeling can be extended to other kinds of applications.  Institutional Review Board Statement: Not applicable.

Informed Consent Statement: Not applicable.
Data Availability Statement: Publicly available datasets were analyzed in this study. This data can be found here: http://snap.stanford.edu/jodie/, accessed on 22 January 2022. The JingDong dataset presented in this study are available on request from the corresponding author. The data are not publicly available due to privacy.

Conflicts of Interest:
The authors declare no conflict of interest.