Collaborative Co-Attention Network for Session-Based Recommendation

: Session-based recommendation aims to model a user’s intent and predict an item that the user may interact with in the next step based on an ongoing session. Existing session-based recommender systems mainly aim to model the sequential signals based on Recurrent Neural Network (RNN) structures or the item transition relations between items with Graph Neural Network (GNN) based frameworks to identify a user’s intent for recommendation. However, in real scenarios, there may be strong sequential signals existing in users’ adjacent behaviors or multi-step transition relations among different items. Thus, either RNN- or GNN-based methods can only capture limited information for modeling complex user behavior patterns. RNNs pay attention to the sequential relations among consecutive items, while GNNs focus on structural information, i.e., how to enrich the item embedding with its adjacent items. In this paper, we propose a Collaborative Co-attention Network for Session-based Recommendation (CCN-SR) to incorporate both sequential and structural information, as well as capture the co-relations between them for obtaining an accurate session representation. To be speciﬁc, we ﬁrst model the ongoing session with an RNN structure to capture the sequential information among items. Meanwhile, we also construct a session graph to learn the item representations with a GNN structure. Then, we design a co-attention network upon these two structures to capture the mutual information between them. The designed co-attention network can enrich the representation of each node in the session with both sequential and structural information, and thus generate a more comprehensive representation for each session. Extensive experiments are conducted on two public e-commerce datasets, and the results demonstrate that our proposed model outperforms state-of-the-art baseline model for session based recommendation in terms of both Recall and MRR. We also investigate different combination strategies and the experimental results verify the effectiveness of our proposed co-attention mechanism. Besides, our CCN-SR model achieves better performance than baseline models with different session lengths.


Introduction
With the information increasing at a rapid speed on the Internet, recommender systems have been proposed to provide users with their required information in an efficient way [1][2][3][4]. Many general recommendation approaches rely on users' historical behaviors to make personalized recommendations. For example, collaborative filtering (CF) [5] builds the user-item interaction matrix and learns user as well as item representations so as to fill in the matrix and make recommendations. However, in some real scenarios, users' personal information, e.g., user IDs, is not available. For instance, users may not log into the recommender system when using some online shopping service. It is a challenging task to recommend users satisfying items only based on limited behaviors in a session in those cases. Session-based recommendation (SBRS) is then proposed to deal with the task and make recommendations based on an ongoing session [3].
Since the items in a session may be connected due to sequential relations, early methods model the transition and co-occurrence relations between items with ItemKNN [6] and Markov Chain [7]. However, those models have a strong assumption of the independence of the past interactions and are mainly based on the last behavior to make recommendations, which confines the recommendation accuracy for SBRS. Recently, Recurrent Neural Network (RNN) has played an important role in session-based recommendation tasks due to its ability in modeling the sequential relations among items. Hidasi et al. [3] first use Gated Recurrent Unit (GRU) to model user behavior sequences in a session and proposed the GRU4Rec model. After that, attention mechanism has been adopted and helped to boost the performance of session-based recommendation. Li et al. [8] apply the attention mechanism to distinguish different item importances and then combine the weighted hidden states and the last hidden state to make final recommendations. However, when users click some unrelated items in a session, RNN-based approaches may get misled by those noisy interactions, which results in an inaccurate session representation and unsatisfied recommendations. As RNNs mainly focus on the sequential transitions among items in a single way, Graph Neural Network (GNN)-based approaches have been proposed to enrich the item representation with its neighbors through propagating information between adjacent items. For example, Wu et al. [2] propose to construct a session graph to represent a session and then learn the item embeddings with graph neural networks, which achieves satisfied results. However, GNN-based approaches often ignore the sequential information among user behaviors and cannot capture long-term context information for the next item recommendation.
Thus, in our paper, we propose a Collaborative Co-attention Network for Sessionbased Recommendation (CCN-SR), which takes the advantages from both RNN as well as GNN structures. More specifically, we first input the user behaviors in an ongoing session, i.e., user interacted items, into a GRU network to model the sequential relations among those behaviors. Meanwhile, we also construct a session graph for the ongoing session and use another GNN network to model the structural information in those behaviors. After that, we can get the hidden state for each item in the session from the GRU network, and the node embedding for each item in the session from the GNN network. Then, we propose a co-attention mechanism to incorporate the sequential as well as structural information to get an accurate representation for the session. The co-attention mechanism can capture the mutual relations between these two kinds of information, i.e., sequential as well as structural information, and thus generate a more comprehensive representation of each session. Specifically, we design two strategies to achieve our co-attention mechanism, i.e., parallel co-attention and alternating co-attention. We conduct experiments on two public e-commerce datasets to verify the effectiveness of our CCN-SR model and explore the differences between the performances of our proposed two kinds of co-attention mechanisms as well as the simple concatenation strategy. The results demonstrate that our CCN-SR model can achieve better performance than the state-of-the-art baseline model in terms of both Recall and MRR.
The main contributions in this paper can be summarized as follows: • To the best of our knowledge, we are the first to incorporate both structural as well as sequential information and capture their co-dependent relations for sessionbased recommendation; • We propose a Collaborative Co-attention Network for Session-based Recommendation (CCN-SR) model, which introduces the co-attention mechanism upon RNN and GNN networks to get the mutual information between them and enrich the session representations; • We conduct comprehensive experiments on two publicly available datasets by comparing with state-of-the-art baselines to validate the effectiveness of our proposal. Experimental results show that CCN-SR can beat the baselines in terms of both Recall and MRR.
We summarize related literature in Section 2. The details of our proposed model are described in Section 3. The experimental setups and datasets are introduced in Section 4. Finally, we give our analysis of the experimental results in Section 5 and the conclusion in Section 6.

Related Works
In this section, we give a summarization of the related literature for our work. We mainly divide them into three aspects: general recommendation approaches, session-based recommendation approaches and attention based approaches.

General Recommendation Approaches
General recommendation approaches have been applied widely in recommender systems, which predict users' general preference based on their historical interactions. Most general recommenders are based on Collaborative Filtering (CF), which aims to factorize the user-item interaction matrix into two low rank matrices containing user latent vectors as well as item latent vectors [9][10][11][12][13]. Traditional methods such as Singular Value Decomposition (SVD) [14] generate a user's preference towards an item with a linear product of the user's latent vector and the item's latent vector. However, the linear kernel often cannot model the users' preference accurately, and many researchers have pointed out that nonlinearity has potential advantages for improving the performance of recommender systems with extensive experiments [15][16][17]. Thus, the deep learningbased recommendation approaches have been proposed and boost the performance of general recommendations.
Restricted Boltzmann Machines (RBM) [18][19][20] was a proposed as an early neural based recommender system. It applies a two-layer undirected graph to model tabular data, such as users' explicit ratings of movies. For top-N recommendation, He et al. [5] propose to use multi-layer perceptrons to model the the two-way interaction between users and items, which captures the non-linear relationship between users and items and achieves satisfied results. Some other recommendation models [21][22][23] use a convolutional neural network (CNN) to integrate external information, e.g., the review text or contextual information, which helps to improve the recommendation performance.
However, those general recommendation models often ignore the changes in users' preferences and always generate the same recommendations to a user. Thus, they are not suitable for session-based recommendation, where the recommended items should be adopted to a user's current interest.

Session-Based Recommendation Approaches
Session-based recommendation aims to capture users' dynamic preferences in an ongoing session. In early stages, Markov Chains has been adopted to model the transition relations between adjacent items [7,[24][25][26]. Recently, neural network based approaches, e.g., RNN-based models, have been widely adopted for session-based recommendation. Hidasi et al. [3] first introduce Gated Recurrent Unit (GRU) to model the current session and propose a GRU4Rec model as well as a session-parallel mini-batch training process. Following [3], an improved RNN-based approach has been proposed in [27], which applies a data augmentation strategy and solves the distribution shifts in the input data. Hidasi and Karatzoglou [28] propose an improved loss function to optimize the training process of the GRU4Rec model and achieve good performance. Bogina and Kuflik [29] incorporate the dwell time to the RNN structure and boost the performance for sessionbased recommendation. It also indicates that the sequential relation among items in a session is importance when making recommendations. There are also some memory-based approaches for session-based recommendation. For example, Chen et al. [30] propose a Recommendation with User Memory Network (RUM) model, which uses external memory to store and distinguish users' interactions in a session.
RNNs and memory network cannot capture some complex relations among items in a session, e.g., some structural information between those items. Thus, GNN-based approaches have been proposed [2,31]. Wu et al. [2] is the first to introduce graph neural network into session-based recommendation and propose a SR-GNN model. They construct a session graph for each session and then apply the gated graph neural network (GGNN) to generate node representations in the graph, which finally help to make recommendations. Based on this work, some researches take the long-term dependencies among items into consideration and generate more accurate session representations [32,33]. Moreover, Yu et al. [34] propose a target-attention mechanism within a graph neural network, which also improves the recommendation performance over the SR-GNN model. In addition, Qiu et al. [35] adopt a weight graph neural network to distinguish different importance of the propagated information. However, those GNN-based approaches cannot model the sequential relations between user behaviors accurately and only propagation information between adjacent items, which limits its ability in capturing context information in a session.

Attention-Based Recommendation Approaches
The attention mechanism helps to distinguish the importance of different items in a session, which can boost the performance for session-based recommendation [36,37]. Li et al. [8] propose a Neural Attentive Recommendation Machine (NARM) model, which regards the last hidden state modeled by a session-based RNN as the global encoder, and uses other hidden states for calculating attention weights of different items to capture users' current intents. As for memory-based models, Liu et al. [36] propose a short-term attention memory priority model, i.e., Short-Term Attention/Memory Priority Model (STAMP), where the attention weights for different interactions are calculated based on the session context and the final records in the current session. SR-GNN model [2] also applies the attention mechanism to distinguish importance of different items in the current session, which is the same as the way in NARM [8].
The aforementioned methods often adopt the attention mechanism that is enhanced by the last hidden state, which is not suitable for capturing the mutual information between different structures, i.e., RNNs and GNNs. In contrast, we propose to use co-attention mechanism to generate co-dependent representations of each item in a session and thus can make more accurate recommendations.

Methods
The Collaborative Co-attention Network for Session-based Recommendation (CCN-SR) model we propose in this paper mainly contains four components: an RNN-based session encoder, a GNN-based session encoder, a co-attention network and a prediction layer. We show the main framework of our model in Figure 1, in which these components can be trained and optimized in an end-to-end way. In the following sections, we first describe the problem formulation as well as notations, and then we give detailed descriptions of each components in CCN-SR.

Problem Formulation and Notation
Given a user and their sequential interactions in a session, we aim to recommend their next interaction based their short-term preferences learned from previous behaviors in the session.
We denote the current session as S = {v 1 , v 2 , . . . , v T }, where v i is the i-th item interacted by a user in the session; T denotes the number of events in the current session. In Figure 1, an embedding layer is built at the bottom of the network which is used for generating the item embeddings shared by both the RNN network as well as the GNN network. We use v i to indicate the embedding of v i . We summarize the notations we use in our paper in Table 1.  Table 1. Summary of the main notations used in the paper.

Notation Description
S the current session v i the embedding of item v i h t hidden state at timestep t in RNN-based session encoder z t the update gate in GRU r t the reset gate in GRU the node vector generated by the GNN-based session encoder for item v t S g the output of the GNN-based session encoder C the affinity matrix in the co-attention network W sr , W sg ,W r t , W g t ,w hr , w hg the weight parameters in the parallel co-attention strategy B r , B g ,B p trainable parameters in the parallel co-attention strategy B cg , m, A 1 , A 2 ,B a trainable parameters in the alternating co-attention strategy U S the final session representation of S generated by the co-attention network y i the prediction score of item v i

RNN-Based Session Encoder
Recurrent Neural Network (RNN) has been widely used to model the sequential data. Given a sequence like {v 1 , v 2 , . . . , v T }, RNN calculates a hidden state h t for each step t in the sequence, which mainly contains summative information of the sequence until the step t. h t is computed based on the hidden state of its former step h t−1 and its current input v t : where f is the main function in RNN. Different RNN architectures, e.g., Long Short-Term Memory unit (LSTM) [38] and Gated Recurrent Unit (GRU), have different functions. In our paper, we use GRU as the RNN-based session encoder since it shows better performance than simple RNNs and LSTM [3]. GRU contains the input gate, reset gate and update gate, which are used to control the information propagated from former steps to the current step. The hidden state h t can be calculated by a linear combination of former hidden state h t−1 and the candidate hidden where the update gate z t is given by: where W z and U z are update parameters for v t and h t−1 , respectively. The candidate hidden state can be computed as: where denotes the Hadamard product, which is an element-wise product of matrices. The reset gate r t can be calculated by: where W r and U r are reset parameters for v t and h t−1 , respectively. As the hidden state of each step contains sequential information among user previous behaviors until this step as well as the user's current intent, we collect the hidden state of each step in the ongoing session modeled by the RNN structure as S r = {h r 1 , h r 2 , . . . , h r T } and S r ∈ R D×T , where D is the dimension of each hidden state in S r . We then explore the structural information contained in current session with another GNN structure.

GNN-Based Session Encoder
In this section, we model the transition relations between items and generate accurate item embeddings in the current session with a graph neural network. Let V S = {v 1 , v 2 , . . . , v m } denotes the unique items in S. Note that m may be smaller than T since there usually exists some repeat interactions with the same item in a session [35,39].
We first construct a directed session graph G s = (V s , E s ) for each session, where V s and E s denote the nodes and edges, respectively. Each node v s,i represents for an item v i in the current session, where v s,i ∈ V S . Each edge (v s,i−1 , v s,i ) indicates that the user interacts with item v i after v i−1 in the session. As some edges may appear several times in a session, we give those edges different weights to distinguish the importance of them. Specifically, the weights are calculate based on the occurrence of the edge divided by the outdegree of the start node of the edge. We then build the adjacency matrices A out s and A in s , which represents the connection between nodes in the session graph with outgoing edges and incoming edges, respectively. By concatenating these two matrices, we can get the matrix A s , which is then used in the learning process of the graph neural network.
The graph neural network can incorporate a node's neighbor features and update the node representation of v s,i as follows: where H ∈ R D×2D indicates the weights, A s,i: is the column in A s corresponding to the node v s,i . a t s,i denotes the information propagated from the neighbors of the node v s,i . z t s,i and r t s,i are the update gate and reset gate. W

Co-Attention Network
As we can see from Equation (6), the GNN-based encoder mainly captures the transition relations between adjacent items and models the structural information in the session. Meanwhile, the output of the RNN-based encoder contains the sequential information in the session. It is beneficial to incorporate this two kinds of information so as to generate a comprehensive session representation for making recommendations. Intuitively, using concatenation of the outputs of the RNN-as well as GNN-based session encoder, i.e., S r and S g , would be a choice. However, we argue that the two kinds of information can provide context for each other and thus help to distinguish the importance of different items while simple concatenation cannot capture this mutual relation.
Thus, in this section, we design a co-attention mechanism to explore the relations between S r and S g and make an accurate session representation. Specifically, we design two strategies to achieve the co-attention mechanism, i.e., parallel co-attention and alternating co-attention. We give detailed analysis in the following sections.

Parallel Co-Attention
After modeling the sequential as well as structural information in the session, we use S r and S g as inputs of our parallel co-attention mechanism. We show the detailed calculation process of the parallel co-attention mechanism in Figure 2. We first calculate the affinity matrix C: where W c ∈ R D×D is a transformer matrix. C can be regarded as a co-relation matrix between S r and S g . We then use it as a context information and calculate the attention scores for hidden state of each step in S r : and the attention scores for each node vector in S g with: where W sr , W sg ∈ R K×D , w hr , w hg ∈ R K are weight parameters for S r and S g , respectively.
Here, α r ∈ R T and α g ∈ R T are the co-attention scores of items in S r and S g .
C Figure 2. Details of the parallel co-attention mechanism.
In both Equations (8) and (10), we emphasis the importance of the user last behavior, which is marked with red lines in Figure 2, i.e., h r T and v g T . It is because that the final interaction often plays an important role in predicting the user's next behavior in sessionbased recommendation especially in some e-commerce scenarios, which has been proved by many researches, e.g., NARM [8] and STAMP [36]. Here, we use W r t , W g t ∈ R K×D as the weight parameters for h r T and v g T so as to emphasis the last behavior adaptively. After calculating the co-attention weights, we can generate co-dependent session representations modeled by the RNN-based session encoder and the GNN-based session encoder as follows: Combining U co−r with h r T and U co−g with v g T , respectively, we can get: where B r ∈ R D×2D and B g ∈ R D×2D are used to compress the two vectors to get the hybrid representations.
Finally, we use a concatenation of U r and U g to generate the final session representation of S as: where B p ∈ R D×2D .

Alternating Co-Attention
In the aforementioned parallel co-attention strategy, we calculate the co-dependent representations U co−r and U co−g in parallel for a time. In this section, we introduce another co-attention strategy, i.e., alternating co-attention, which can also capture the mutual information between S r and S g , as well as integrate the sequential information and structural information for session-based recommendation. We show the details of the alternating co-attention mechanism in Figure 3. As shown in Figure 3, we sequentially alternate between the initial outputs of the RNN-based session encoder and the GNN-based session encoder, as well as the attentive representations of them, which can thus take more information into consideration. First, we calculate the affinity matrix C using Equation (7). Then we do normalization row-wise to produce the attention weights A g across the outputs of the RNN-based session encoder, i.e., S r , for each node vector in the outputs of the GNN-based session encoder, i.e., S g . At the same time, C is also normalized column-wise to conduct the attention scores A r across S g for each hidden state in the outputs of the RNN-based session encoder, i.e., S r : Next, we generate attentive representation of S r as: We can then incorporate C r with the initial representation S r to generate the candidate attentive representation of S g as:Ĉ whereĈ g ∈ R 2D×T . In this way, we can preserve the initial sequential information containing in S r . We also keep the initial structural information and integrate S g intoĈ g and generate the final attentive representation of S g as: where B cg ∈ R D×3D . We finally concatenate C g and C r to generate the reformulated representations of behaviors in the session and adopt the same attention mechanism as NARM [8]: The global and local representations of the session S can be denoted as U S global and U S local : where α is the weighted factor calculated by: where σ is an activation function, and m, A 1 and A 2 are learnable parameters. Then, by concatenating the global as well as local representations, we can generate the final session representation as: where B a ∈ R D×2D .

Prediction and Optimization
With the co-attention mechanism, we can generate the final session representation U S , which integrates the sequential as well as structural information in the ongoing session. Then, we make predictions by conducting the dot product of the session representation and the embedding of each candidate item: As y i is an unnormalized value, we then do softmax across all candidate items so as to get the prediction probability of item i. For training, we apply the widely used cross-entropy loss as our loss function: where p is the ground truth distribution while q is the prediction probability distribution generated based on Equation (23). Then, we can optimize our model with Equation (24).

Experimental Setup
In order to investigate the effectiveness of our proposal, we compare the recommendation performance of CCN-SR and several baseline methods on two public e-commerce datasets. In this section, we introduce the baselines, datasets and experimental setups in detail.

Model Summary
We compare our model with several baselines including traditional methods, i.e., Itempop and BPR-MF, and neural network based approaches, i.e., RNN-based and GNN-based models. Our baselines are as follows: Item-pop It recommends items with high popularities, i.e., items with a large number of interactions [40].

BPR-MF
Factorization-based methods Bayesian personalized ranking (BPR-MF) proposes to use a pairwise ranking loss to optimize the matrix factorization model and make recommendations [41].

GRU4Rec
It proposes an RNN model for session-based recommendation, which utilizes session-parallel mini-batches as well as a pair-wise loss function for training [3].

NARM
It is an RNN based model, which also applies the attention mechanism to emphasis the importance of the last item in the ongoing session [8].

SR-GNN
It proposes to construct a session graph for each session and then adopt the Gated Graph Neural Network (GGNN) model as well as the attention mechanism to capture transition relations among items [2].
We also investigate the performance of different variants of our model proposed in this paper: CCN-SR pa Our proposed CCN-SR model, which adopts the parallel co-attention strategy upon the RNN and GNN networks to get the mutual information between them.

CCN-SR al
Our proposed CCN-SR model, which adopts the alternating co-attention strategy upon the RNN and GNN networks to get the mutual information between them.

Datasets
We evaluate our model as well as the baselines on two public e-commerce datasets:

Tmall
Tmall is a dataset released by Taobao Both of the two datasets contain several lines where each line records a user ID, an item ID, and a timestamp when the user interacts with the item. Following [42], we preprocess the datasets as follows. For the Tmall dataset, we filter out users with fewer than 5 interactions and items that appear less than 5 times. For the Tianchi dataset, we filter out users with fewer than 20 interactions and items with fewer than 50 interactions. The sessions with less than 3 items or more than 200 items are also filtered out. The characteristics of the datasets after preprocessing are summarized in Table 2.

Settings and Parameters
We divide two datasets into training set and test set for training and evaluation, respectively. For the Tmall dataset, we use the last 30 days of interactions, as the test set and the remaining days of interactions are regarded as the training set. For Tianchi, the training set consists of all but the last 7 days of interactions; the test set contains the remaining 7 days of interactions. We also remove items in the test sets that do not appear in the training set [2]. Following [2], we apply the data augmentation strategy for our model as well as all baseline models in our paper.
We adopt the widely used metrics, i.e., Recall and MRR, to evaluate the performance of all models [8,36]. Recall considers whether the ground truth item is contained in the recommendation list; MRR evaluates the ranking accuracy of the model, i.e., whether the ground truth item is at the top position of the recommendation list.
We use the Adam [43] as our optimizer when training the model; the learning rate is initialized as 0.001; the batch size is set as 100 and the dimension of the item embeddings is set to 100. We initialize all trainable parameters using a Gaussian distribution with a mean of 0 and a standard deviation of 0.1. Unless specified differently, we set the recommendation number N as 20.

Results
In this section, we conduct several experiments to evaluate the performance of our model as well as the baselines. We first analyze the overall performance of all models in Section 5.1. We then investigate the impact of different combination strategies of the sequential and structural information on our model, i.e., the parallel co-attention strategy, the alternating co-attention strategy, and the concatenation strategy, in Section 5.2. Finally, we analyze the performance of our model as well as the baseline models on sessions with different lengths in Section 5.3.

Overall Performance
As shown in Table 3, as for baselines, it is obvious that neural-based approaches, i.e., GRU4Rec, NARM and SR-GNN, outperform traditional methods, i.e., Item-pop and BPR-MF. When comparing among the neural-based approaches, for the RNN-based models, i.e., GRU4Rec and NARM, we can see that NARM generally outperforms GRU4Rec in terms of Recall@20 and MRR@20 on both of the two datasets. It is because the attention mechanism in NARM can distinguish different item importances, which can help to filter out some noisy interactions and capture users' main purposes in the current session. As for the GNN-based model, i.e., SR-GNN, it achieves the best performance among all of the baseline models, which indicates that modeling the transition relations between adjacent items can help to improve the recommendation accuracy. Table 3. Performance of recommendation models. The results produced by the best baseline and the best performer in each column are underlined and boldfaced, respectively. Statistical significance of pairwise differences of best model vs. the best baseline is determined by a paired t-test ( for p-value ≤ 0.05).

Tmall Tianchi
Model Recall@20 MRR@20 Recall@20 MRR@20 As to our proposed CCN-SR models, we can see that both CCN-SR pa and CCN-SR al outperform the best baseline model in terms of Recall@20 as well as MRR@20 on Tmall and Tianchi. It demonstrates the effectiveness of considering structural as well as sequential information simultaneously and capturing the co-attentive relations between them for session-based recommendation. Besides, the improvements of our model over the best baseline, i.e., SR-GNN, are significant in terms of Recall@20 and MRR@20 on Tianchi. It may be because Tianchi is sparser than Tmall and the recommendation task is more difficult on Tianchi. However, CCN-SR takes both sequential as well as structural information into consideration, and thus can generate accurate session representations even on a sparse dataset.
Comparing the two variants of CCN-SR, we can find that CCN-SR al shows better performance than CCN-SR pa in terms of all metrics on both datasets except Recall@20 on the Tmall dataset. It maybe because the alternating co-attention strategy incorporates both of the initial and attentive information with Equations (17) and (18), which helps to preserve enough information for recommendation.

Impact of Different Combination Strategies
In this section, we aim to explore the effect of different strategies combining the structural as well as sequential information on the performance of CCN-SR. In Section 3.4, we propose two co-attention strategies to capture the relation between this two kinds of information and combine them for session-based recommendation, i.e., parallel co-attention and alternating co-attention. Besides, simply using concatenation of S r and S g is also an intuitive way to integrate the two kinds of information for session-based recommendation. Thus, in this section, we test the performance of CCR-SR with three different combination strategies, i.e., parallel co-attention, alternating co-attention and concatenation, on the top N recommendation task, with N = 10, 20, 30, 40, 50. We denote the three models as CCN-SR pa , CCN-SR al and CCN-SR concat correspondingly. The results of them on the Tmall and Tianchi datasets are presented in Figure 4.
On the Tmall dataset, as shown in Figure 4a,b, we can see that with the increases of the recommendation number N, i.e., from 10 to 50, the performance of all models improves in terms of Recall and MRR. Our proposed CCN-SR pa and CCN-SR al always shows better performance than CCN-SR concat , which indicates that simply using concatenation cannot fully take advantages of the structural as well as sequential information. Specifically, our CCN-SR al model shows more improvements over CCN-SR concat in terms of MRR when there are less recommendations. For example, when the recommendation number N = 10 and N = 50, CCN-SR al outperforms CCN-SR concat by 2.4% and 1.3% in terms of MRR, respectively. It may be because our CCN-SR al model can always rank the ground truth item at the top position of the list. The ranking accuracy of CCN-SR al is not influenced much by the recommendation numbers. This also demonstrates that our model can be applied in some scenarios where the number of recommendations is limited, e.g., mobile recommendations.
For the Tianchi dataset shown in Figure 4c,d, the results are similar to that on the Tmall dataset, except that CCN-SR al outperforms CCN-SR pa in terms of both Recall and MRR on Tianchi while CCN-SR al beats CCN-SR pa only in terms of MRR on Tmall. It indicates that CCN-SR al is better at improving the ranking accuracy than CCN-SR pa , since CCN-SR al takes more information (both initial and attentive information) into consideration through the alternating co-attention calculation process.

Impact of the Current Session Length
In this section, we evaluate the performance of our models as well as baselines with different lengths of sessions. We group the sessions on the test sets into short, medium and long sessions. For Tmall, we regard sessions with no more than five interactions as short sessions, sessions with more than 10 interactions as long sessions and others are medium sessions. For Tianchi, short sessions contain no more than 25 interactions, long sessions contain more than 50 interactions and others are medium. We report the results for short, medium and long sessions on the test set of Tmall and Tianchi in Figure 5.
Clearly, as shown in Figure 5a,b, with the increases of the session length, the performance of all models in terms of Recall@20 and MRR@20 improves, and our CCN-SR models always achieve better performance than those baseline models on the Tmall dataset. For baselines, we can find that SR-GNN and NARM shows better performance than GRU4Rec. However, the improvements of SR-GNN over GRU4Rec decrease with the increases of session length. For instance, SR-GNN beat GRU4Rec by 10.5% and 17.8% in terms of Re-call@20 and MRR@20 with short sessions on Tmall while 5.7% and 9.0% with long sessions on Tmall. It indicates that the RNN model is better at dealing with long sessions than short sessions. As for our CCN-SR models, the improvements of our models over SR-GNN in terms of MRR@20 is more than that in terms of Recall@20. For instance, CCN-SR al beats SR-GNN by 1.6% and 3.2% in terms of Recall@20 and MRR@20 with long sessions on Tmall. This demonstrates that our model can achieve better performance in terms of recommendation ranking accuracy. On the Tianchi dataset reported in Figure 5c,d, the performance of all models decreases with the increases of the session length and our proposal can still achieve the best performance compared with the baseline models. Specifically, for Recall@20, SR-GNN outperforms GRU4Rec on short sessions. However, the gap between the performance of the two models becomes smaller when the lengths of sessions change from short to medium. To be more specific, on long sessions, GRU4Rec even shows better performance than SR-GNN. This phenomenon is more obvious for MRR@20. SR-GNN beats GRU4Rec by 9.5% and 3.7% in terms of Recall@20 and MRR@20 with short sessions while the improvements of GRU4Rec over SR-GNN are 2.8% and 19.3% for Recall@20 and MRR@20 with long sessions. This demonstrates that the RNN and GNN models are good at dealing with different lengths of sessions and thus integrating the information learned from them can help to boost the performance for session-based recommendation.

Conclusions
In this paper, we propose a collaborative co-attention network for session-based recommendation, i.e., CCN-SR. CCN-SR incorporates both sequential as well as structural information learned from an RNN-based session encoder and a GNN-based session en-coder, and capture their co-dependent relations for session-based recommendation. Specifically, we propose two co-attention strategies, i.e., parallel co-attention and alternating co-attention, to generate co-dependent representations of the two kinds of information and then combine them for making recommendations. We conduct comprehensive experiments to verify the effectiveness of our model and explore the impact of different combination strategies. We prove that our proposed co-attention mechanism shows more competitive performance with different recommendation numbers than the simple concatenation strategy. As to sessions with different lengths, the experimental results demonstrate that our CCN-SR model outperforms state-of-the-art baseline model across all session lengths. Our proposal shows more competitive performance especially when the recommendation number is smaller. Thus, it can be applied in some mobile recommendation scenarios, where the length of recommendation list is limited.
As to limitations of this work, on the one hand, our proposal has higher computation complexity than the baseline models due to the co-attention network; on the other hand, the improvements of our model over the best baseline model are not significant in some cases, which may be due to the sparsity of different datasets. The improvements of our model over the baselines are more obvious on the dataset, which is sparser. Thus, we plan to explore our model with more datasets in the future.
As for future work, on the one hand, we plan to incorporate some external knowledge, e.g., category information and content information, to capture the item relations more accurately [44][45][46][47][48]. For example, some items are complement products to each other and should be recommended together, especially in some e-commerce scenarios. On the other hand, different user behaviors, e.g., "click", "collect" and "buy", can also provide important information for capturing the user's preference. For instance, "collect" and "buy" show a stronger consumption motivation of a user for an item than "click". Thus, we plan to extend our model with the behavior information to generate more informative representations of sessions [49][50][51].