An Attention-Based Recommender System to Predict Contextual Intent Based on Choice Histories across and within Sessions

Recent years have witnessed the growth of recommender systems, with the help of deep learning techniques. Recurrent Neural Networks (RNNs) play an increasingly vital role in various session-based recommender systems, since they use the user’s sequential history to build a comprehensive user profile, which helps improve the recommendation. However, a problem arises regarding how to be aware of the variation in the user’s contextual preference, especially the short-term intent in the near future, and make the best use of it to produce a precise recommendation at the start of a session. We propose a novel approach named Attention-based Short-term and Long-term Model (ASLM), to improve the next-item recommendation, by using an attention-based RNNs integrating both the user’s short-term intent and the long-term preference at the same time with a two-layer network. The experimental study on three real-world datasets and two sub-datasets demonstrates that, compared with other state-of-the-art methods, the proposed approach can significantly improve the next-item recommendation, especially at the start of sessions. As a result, our proposed approach is capable of coping with the cold-start problem at the beginning of each session.


Introduction
In recent years, we have witnessed the rapid growth of neural networks for a variety of applications in computer vision, natural language processing, speech recognition, etc. Neural network provides four key benefits: (1) It works with unlabeled data; (2) It learns low-level features from minimally processed raw data; (3) It detects complex interactions among features; and (4) It works with large class memberships [1,2].Recent achievements have shown that it is also of great use to combine different neural networks into recommender systems.There are many works which employed neural networks to improve the recommendation in the domain of recommender systems [3][4][5][6][7] and in the context of web-based decisions [8][9][10][11][12].The success of these recommendation approaches based on neural networks demonstrates its capability.
Session-based recommender systems are good examples.Although not a novel research topic, session-based recommender systems are largely under-investigated [3,4,13].Compared with traditional recommendation methods, a session-based recommender system is more appropriate for capturing dynamic and sequential user behavior.The Recurrent Neural Network (RNN) plays an important role in session modeling, and it seems to have promise for further improvement.An RNN, compared to many other recommendation models, takes advantage of the ordered sequence in a very natural way [3,5,7].
Automated and personalized recommendations of the type "You may also be interested in" are common on e-commerce sites nowadays, and there exists sufficient evidence that such item related recommendations can measurably impact businesses [14][15][16][17][18][19].Undoubtedly, one of the representative business scenarios is in e-commerce.Modern online shops are not static catalogs of items anymore but tend to fulfil the user's interests by providing personalized product recommendations.In the best case, these recommendations should match both the users' long-term preference as well as their current shopping goals.On Amazon.com, for example, each product page contains multiple personalized recommendation lists with different purposes.Though the Amazon recommender system employs collaborative filtering approach [20], it also uses solo-user recommendations as well.It provides alternatives to the product currently viewed, remind the user of items recently viewed, or render recommendations that should appeal to the general taste of the user, as shown in Figure 1.As a result, the content displayed on the page not only depends on the features of the currently viewed article (i.e., the product category), but can also be influenced by an integration of the user's historical shopping behavior (long-term preference) and his most recent interactions of viewing different products (short-term shopping goals).In this case, we assume that user's preference is composed of two components: (1) the long-term preference which reflects the considerably stable interests of the users based on their online activities, and (2) the short-term intent which represents the users' current interests.There are many existing techniques that investigate the combination of the user's long-term preference and the short-term intent [13,[21][22][23][24][25][26][27].In addition, the user's preference for items constantly evolves over time [28], while these works implicitly assume the user preference to be stationary, which seems to be an unrealistic assumption in many scenarios, particularly in the news or shopping related scenarios [5][6][7]13,29].Therefore, such methods perform poorly when preferences, in fact, are context-sensitive and transient.There are session-based approaches [7,[29][30][31][32] that focus on the interest drifts, to solve the problem of the static user preference and intent, while these methods assume that each item in one session has the same influence, which is usually an invalid assumption, especially when considering a user's interactions with items of diverse characteristics.One of the main attractions of search-engine advertising is that search engines take into account the "current interest" of the user and thereby can deliver the right message, ad, or product just at the right time.The enormous growth in spending going to search advertising (about $100b in the US in 2018 [33]) is a testament to the power of contextual "interest-based" advertising.To deal with the problem of the static weights being assigned to each item, researchers proposed several approaches [34][35][36][37].Our proposed model goes a step further, not only taking the user's interest drift into consideration, but also regarding each item in one session as "the one and only" item, which means each one has a unique influence on the user and therefore the user may pay different attention to each one.
In this work, we propose a novel approach named Attention-based Short-term and Long-term Model (ASLM), to solve the next-item recommendation problem.It has been shown that the attention mechanism is able to assign different weights according to the importance of the items, and there have been works that employed the attention mechanism to form the characteristics of the user [38][39][40].In order to capture the variation in user preference and generate the precise contextual user profile, we build two different RNN-based layers: one attentive layer to model the user's short-term intent, and the other one to model his long-term preference.The main contributions of this paper include:

•
We introduce the attention mechanism to investigate the variation in a user's short-term intent for session-based recommendations.

•
Taking the user's previous sessions into account, we model the user's long-term preference and concatenate that with short-term intent to account for more comprehensive contextual user outcomes.

•
We conduct empirical evaluations and validate the effectiveness of the model on three real-world datasets and two sub-datasets.

•
We demonstrate the effectiveness of the approach in resolving the cold-start problem.

Related Work
Collaborative Filtering (CF) is the most commonly used and studied technology [20,41].CF allows users to give ratings about a set of elements (e.g., videos, songs, films, etc. in a CF-based website) in such a way that when enough information is stored on the system, we can make recommendations to each user based on information provided by other users we consider to have the most in common with them [42].The most commonly used algorithm for CF is the k Nearest Neighbors (kNN) [41,43,44], which offers several mechanisms to identify similar users and pool recommendations.The other main approach for CF is matrix factorization (MF) [42,45], which regards recommendation as a dimensionality reduction problem.However, these methods are far from researchers' satisfaction with several major challenges:

•
Resulting from the high level of sparsity [46,47] in recommender system databases, user similarity measures often encounter processing and cold-start problems from inadequate mutual ratings for a comparison of users and items [48][49][50].

•
The CF-based recommendation lacks diversity, because it only recommends popular items from similar users but cannot recommend unique tastes of a user.

•
When numbers of existing users and items grow tremendously, traditional CF algorithms will suffer serious scalability problems, with computational resources going beyond practical or acceptable levels [45].

•
Most importantly, they are not qualified to capture the temporal context and sequential patterns of user behavior (e.g., the evolving user long-term preference and short-term intent) [51].In other words, when given the data which does not have ratings involved but pure user-item interaction (or choice) logs, CF-based methods may not work.
To solve the problem of dealing with the sequential patterns of user behavior and temporal aspects of recommendations, there are a variety of works that focus on session-based recommendation which features sequence modeling.In academia, sequential recommendation problems are typically regarded as the task of predicting the next user action.Experimental evaluations are usually based on larger, time-ordered logs of user actions (e.g., the users' item viewing and purchasing activities on an e-commerce shop), which contains both the implicit and explicit feedback from users and provides the recommender system the capability of capturing the sequential patterns of users.
Zimdars et al. [25] framed CF as a sequence prediction problem and used a simple Markov model for web-page recommendation.Mobasher et al. [26] proposed a similar approach, using sequential pattern mining.Both showed the superiority of methods based on sequence over nearest-neighbor approaches.Park et al. [27] presented a modified User-based CF method, called Session-Based CF (SSCF), that uses information from similar sessions to capture sequence and repetitiveness in the listening process.Most recently, researchers have successfully adopted deep learning techniques in recommender systems.In particular, RNN, which is capable of learning models from sequentially ordered data, is a "natural choice" for sequence modeling [4].Hidasi et al. [13] first introduced the idea of using RNN for session-based recommendation to deal with sparsity problems and proposed the GRU4REC architecture.Tan et al. [21] improved the model performance proposed in [13] by proposing the data augmentation and the method to account for shifts in the input data distribution.Jannach et al. [22] also improved the model proposed in [13] by proposing a model combining the kNN approach with GRU4REC.Dong et al. [23] improved the RNNs [52] and the MF model [53] by combining them in a multi-task learning framework, where they performed joint optimization with shared model parameters enforcing the two parts to regularize each other.Liang et al. [24] employed RNN to extract the item and user profiles from user-defined items' tags and their tagging behaviors.However, these session-based methods regard user preference for items as a stationary property and therefore cannot capture the changing contextual interactions between user and items.
To solve the problem of the static user preference and intent, Jing et al. [30] focused on the variation in user intent at different times and proposed an LSTM-based "Just-In-Time" approach to recommend the right item at the right time.Wu et al. [31] proposed an LSTM-based approach that implicitly captures various known temporal patterns in movie ratings data without explicit inclusion in the model, to learn dynamic embeddings of users and movies.The next-basket recommendation model proposed in [32] learned the varying representation of a user and captured global sequential features among baskets to reflect the user's dynamic intent at different times and interactions of all baskets of the user over time.Ruocco et al. [7] improved the next-item recommendation and cold-start problem by implementing two RNNs to extract user preference from recent sessions and the current session, respectively.Hu et al. [29] designed an efficient personalized session-based recommender system with shallow wide-in-wide-out networks over relaxed ordered user-session contexts.These methods assume that each item in one session has the same influence.However, this assumption is usually invalid, especially when considering a user's interactions with items of diverse characteristics.In the meantime, not only his preference for items changes among different sessions, but also his intent on items within one session.
To deal with the problem of the static weights being assigned to each item, Cheng et al. [34] developed an attention-based model to capture the changing attention that a user pays to different text reviews and ratings.Wang et al. [35] designed an attention-based transaction embedding model to weight each item in a transaction without assuming order.To solve these two issues at the same time in a unified way, we follow the idea in [7] but make the following contributions: (1) We propose an Attention-based Short-term Intent Layer to assign different weights for items in the current session to capture the contextual property of user intent.(2) We propose a Long-term Preference Layer to take the user's previous sessions into account, to provide information to the attention-based short-term intent layer and enable it to combine both the user's short-term intent and long-term preference in the attention process.

The ASLM Architecture
In this section, we first formulize the description of the problem as explained above and the main idea of Attention-based Short-term and Long-term Model (ASLM) proposed in this paper.Afterwards, we elaborate by diagram the architecture of ASLM.

Problem Formulation
Let U denote a set of users and V denote a set of items, where |U | and |V | are the total numbers of users and items, respectively.In this work, we concentrate on extracting the user contextual intent and preference from sequential user-item interaction events (e.g., commenting to a thread or listening to music).For each user u ∈ U , the sequential sessions are denoted as S u = {S u 1 , S u 2 , ..., S u T }, where T is the total number of time steps and S u t ⊆ S u (t ∈ [1, T]) represents the item set related to the interaction of user u at time step t.In each session, there is a set of events {e u t,i ∈ R m |i = 1, 2, ..., |S u t |}, where e u t,i represents the event i in the session S u t .For each event, the user u interacts with an item v i ∈ V.Both S u t and e u t,i ∈ S u t are ordered by time sequence.For a time step t, the session S u t implies the u's short-term intent at time t, while the sessions before time step t, defined as As in [7], our task is to predict each successive item in a session S u t .That is, for a sub-session{e u t,1 , e u t,2 , ..., e u t,i , ..., e u t,n } of S u t , the system is to predict e u t,i+1 .This is repeated The goal of the system is to make the next item, e u t,i+1 , be located as close to the top of the ranked list as possible.

Main Idea
Consider the following example of a user shopping scenario: On May 15th, John purchased a printer and a smartphone.On July 10th, when the ink cartridge ran out, he wanted another to replace it.At that point in time, he would have been happy to see an ad for any ink cartridge and would have bought another brand if the system had recommended it.Also, since John was in the purchasing process, he might also have been attracted to a cross-sell recommendation of a phone-protector case.However, he would have been much less interested, in July, about recommendations for another brand-new printer or smartphone.
In the ASLM model, there are two layers: the Long-term Preference Layer and the Attention-based Short-term Intent Layer.Our model is illustrated in Figure 2, where two mutually exclusive methods are shown in one figure to conserve space: (a) ASLM-AP, and (b) ASLM-LHS.In other words, only one method is used separately at a time, and has no connection between each other.The two ASLM models only differ by how they feed information to the Attention-based Short-term Intent Layer, either using average pooling (ASLM-AP) or the last hidden state (ASLM-LHS).Based on the experiment results discussed in Section 5, researchers can feel free to use either of these two models since their performances are at the same level.
The task of the Long-term Preference Layer is to supply knowledge of the user's long-term preference to the Attention-based Short-term Intent Layer before the short-term intent layer investigates the current session at the beginning.The task of the Attention-based Short-term Intent Layer is to produce recommendations by taking the user's long-term preference into consideration when the short-term layer is processing the sequence of items in the current session to extract his short-term intent.
We discuss the work flow of these two layers in detail in Section 3.3.Particularly, the Attention-based Short-term Intent Layer's procedure of capturing the characteristics of each step in the user behavior is thoroughly discussed at the part of decoder in Section 3.3.2.Our model is proposed according to the following characteristics of user preference.(1) User preference changes at different time steps.(2) Within one time step, the user may have another intent on a different category of items, which has a different impact on the future interaction of items.(3) Since there are different preferences across users, the importance of the same item may differ, which results in different influences on the prediction of the next item for different individuals.The main idea of our approach is to: (1) use the GRU (Gated Recurrent Unit) [13] in the Long-term Preference Layer to process a user's recent sessions, i.e., a user's long-term preference, which is similar to the Inter-session RNN in [7].
(2) enhance the recommendations by employing an augmented attention-based encoder-decoder in a structured framework named Attention-based Short-term Intent Layer.We use the encoder to process the events within a session, i.e., a user's short-term intent, and make the decoder receive the user's long-term preference representation P u L from the user's history session representations as its initial hidden state.As a result, the model enables the Attention-based Short-term Intent Layer to combine both the user's long-term preference and short-term intent in the attention process.Eventually, for ASLM-LHS, the decoder stores its output state in session representations for the Long-term Preference Layer to update the states for future recommendation.For ASLM-AP, we put the average of current session's embedded vector representations of the items into session representations.

Session Representations
Long-term Preference Layer

Model Description
In this section, we discuss the ASLM model's entire recommendation process and the main system components with the adopted technologies we introduced.
As illustrated in Figure 2, ASLM integrates the modeling of the user's Long-term and augmented Attention-based Short-term Intent Layers into one single architecture.The Long-term Preference Layer is evolved from the Inter-session RNN used in [7], and the model proposed in it is regarded as a baseline to compare with ASLM model.

Long-Term Preference Layer
For the short-term intent layer, it learns about the user's intent throughout the session, but discards all the information at the end of the session.Similar to the model proposed in [7], we employ a GRU to take the user's k most recent history sessions into account.At the beginning of each session, the GRU models the user's long-term preference representation P u L from the user u's history session representations and provides it for the GRU-based decoder in the attention-based short-term preference layer as the initial hidden state.
The graph inside the dark green box in Figure 2 illustrates the work flow of the Long-term Preference Layer.For each session S u t in u's interaction history S u , let s u t be an embedded vector representation of that session.The Long-term Preference Layer takes a series of vector representation of u's k most recent history sessions {s u t−k , s u t−k+1 , ..., s u t−1 } as input, where s u t−1 is the most recent history session.Afterwards, the layer produces the initial hidden state, H 0 , of the Attention-based Short-term Intent Layer before the short-term intent layer starts making predictions.As a result, it enables the Attention-based Short-term Intent Layer to combine both the user's short-term intent and long-term preference at the attention process.

Attention-Based Short-Term Intent Layer
For predicting next items, a user's short-term preference is important.A user's short-term preference consists of different kinds of short-term intent on different categories of items.Previously, Rendle et al. [54] has done some research on combining long-and short-term preference for sequential recommendation.However, in the earlier works, researchers regard the user's short-term preference as a stationary property, and therefore items are given the same weights and the variation in short-term intent is not evaluated properly.To resolve this problem, we employ an attention-based encoder-decoder architecture, to assign weights to both long-and short-term session representations, to form the characteristics of user u comprehensively.In addition, we employ bi-directional LSTM and GRU as the encoder and decoder, respectively.One advantage of bi-directional LSTM over uni-directional LSTM is that it uses each event e u t,i ∈ S u t to predict recommendations based on each event's past and future contexts, rather than just its past ones, which can be sensitive to the variation in the user's short-term intent from both sides.Attention mechanism has been widely and successfully applied in many tasks, e.g., information retrieval [38], computer vision [39] and recommendation [40].
The key idea of attention is to learn to assign attentive weights (normalized by sum to 1) for a set of features: higher weights indicate that the corresponding features are informative for the task.It computes the importance of each item both in the short-term and long-term item sets of a given user and integrates these items' embeddings to build the representations of the user's intent and preference.
The graph inside the purple box in Figure 2 illustrates the work flow of the Attention-based Short-term Intent Layer.In the following paragraphs, we state the functionality and the working procedure of each component in the Attention-based Short-term Intent Layer.
Encoder.A bi-directional LSTM is employed as the encoder.We use item embeddings to represent each item.At the beginning of each session, each event e u t,i which consists of the embedded item representation is sent through a bi-directional LSTM to be processed together with its past and future contexts, to investigate the sequence patterns of the current session.Afterwards, the bi-directional LSTM produces its output vector, namely the source hidden states vector [h 1 h 2 ... h i ] where h i is a source hidden state regarding the item v i , and includes the latent information of v i .Finally, they are sent to the architecture of attention mechanism.
Decoder.A GRU is employed as the decoder.The procedure of capturing the characteristics of each step in the user behavior is stated below.At the beginning of each session, the GRU receives the user's long-term preference representation P u L from the Long-term Preference Layer as the initial hidden state.In the meantime, the GRU takes the context vector c j produced by the Attention procedure into consideration.In other words, c j provides the distribution of every event e u t,i 's influence upon the current target item prediction in the current session, from which it enables the GRU to find out which events are more important to the current choice.Apparently, the more important events are highly related to the user u's current intent.Therefore, when the time step t increases and the previously important events become less important, the u's intent probably has changed.The GRU is ready for capturing the variation of the user u's current intent.In other words, our proposed model is aware of the variation of the user u's short-term intent.Moreover, it makes the recommendation with the help of not only the short-term intent, but also the long-term preference.Finally, the GRU produces the output vector, namely the target hidden states vector [h 1 h 2 ... h j ] where h j is a target hidden state regarding the current target item prediction y j .The target hidden states vector, together with the context vector c j stated below, is used for the calculation of the target item prediction.
Architecture of Attention Mechanism.The main part of the attention mechanism is illustrated in Figure 3.The context vector c j is used to capture relevant source-side information to help provide the current target item prediction y j .Given the target hidden state h j and the context vector c j calculated in Equation ( 2), we use a concatenation to integrate both vectors to produce an attentional hidden state: where where W c is the model parameter, and c j is computed by the sum of the product of each attentive weight α ji and source hidden state h i .A dense layer is used to fully connect the neurons in the encoder and the decoder.α ji is the amount of attention that the output prediction of h j should pay to h i , which can be calculated by the SoftMax function: where α ji is used to compare the target hidden state h j with each of the source hidden state h i : Denote s u t as the embedded vector representation of the session S u t .After the calculation of the attentional vector a j , it is fed through another SoftMax layer to produce the predictive distribution.Given the predictions y <j = {y 1 , y 2 , ..., y j−1 }, s u t and the weight W i of a j , the predictive distribution of items is formulated as: p(y j |y <j , s u t ) = so f tmax(W i a j )  We apply two methods to generate the session representations s u t similar to that in [7].One is to generate s u t with the average of the embedded vector representations of the items in the session, as illustrated in Part Method (a): ASLM-AP of Figure 2. The other is to use the last hidden state of

Baselines
We compare our model with the following baseline algorithms.
Most Popular.This baseline always recommends the absolute most popular items in the training set.All items are sorted by their number of occurrences in the training set, and the top K items are recommended at each time step.In other words, the same items, which are the most popular ones among all the users, are recommended to each user.Although it is a simple baseline, it is usually a strong baseline especially in recommendation domains.
II-RNN-AP [7].This baseline uses two RNNs to extract information from the user's current session and history sessions, respectively.II-RNN-AP is the II-RNN Model with average pooling to create session representations from items.The model's configurations are the same with that in [7].
II-RNN-LHS [7].II-RNN-LHS is the II-RNN Model where the last hidden state of the Intra-session RNN is stored as the session representation.We use the same configurations in [7].
SWIWO [29].This full model is a three-layer network that predicts the relevant item based on the user-session context.We use the same configurations in [29].
SWIWO-I [29].This baseline is the simplified SWIWO model which only models item-session contexts without considering users.We use the same configurations in [29].

Evaluation Metrics and Hyperparameters
To evaluate the performance of each method, we employ two commonly used metrics Recall@K and MRR@K with K = 5, 10, 20 to evaluate all models.The first metric evaluates the fraction of ground truth items that have been rightly ranked over top-K items in all testing sessions, while the second metric, MRR, is a statistic measure for evaluating any process that produces a list of possible responses to a sample of queries, ordered by probability of correctness.The reciprocal rank of a query response is the multiplicative inverse of the rank of the first correct answer.For hyperparameters, we experimented with different ones to find the best configurations of our method for each dataset.The configurations of our model are summarized in Table 4.

Results
In this section, the model performances of the proposed models are shown, according to the experimental settings we stated in the previous section.

Model Performance
Tables 5-7 show two groups' test set performances of all methods under the metric of Recall@K and MRR@K: (1) For Reddit and Last.fm, we set K = 5, 10 and 20.(2) For 1/10 Reddit and 1/80 Last.fm, we set the same K as the full versions of Reddit and Last.fm.For Tmall, to keep consistent with the evaluation in [29], we use Recall@10, Recall@20 and MRR@20 to evaluate methods, and cited the same evaluation results of SWIWO and SWIWO-I in [29].Relative scores are given compared to the strongest baselines.For all datasets, ASLM outperformed all baselines with large margins.Table 8 shows the validation set performances of three different methods under the metric of Recall@K and MRR@K in Reddit and Last.fm datasets.Please note that for Reddit dataset, the validation set performances of ASLM-AP and ASLM-LHS are almost the same, while for Reddit and Last.fm datasets, the validation set performances of ASLM-AP and ASLM-LHS are all better than the test set performances with moderate margins, respectively.Table 9 shows the validation set performances of three different methods in 1/10 Reddit and 1/80 Last.fm datasets.The test set performances of the two ASLM models in both 1/10 Reddit and 1/80 Last.fm datasets are better than the validation set performances with a slightly larger margins than that for the full versions of Reddit and Last.fm datasets, respectively.

Discussion
In this section, we discuss the superiority of the proposed models through comparison of performance and demonstrate their effectiveness on the session cold-start problem.

Comparison of Performance
For group A, from Table 5 we observe that ASLM-AP and ASLM-LHS consistently outperform all given methods under all measurements for testing cases on both Reddit and Last.fm datasets with large margins.Specifically, ASLM-AP improves 131.97% and 624.88% at Recall@5, 212.79% and 1033.00% at MRR@5 compared with the II-RNN-AP method for test cases in Reddit and Last.fm datasets, respectively.The good performance of our model is attributed to the attention mechanism and the bidirectional-LSTM we employed in the attention-based layer.One of the main characteristics of the attention mechanism is that when predicting the next item, it takes every event within the current session into consideration, which provides a sufficient basis for an accurate recommendation.As discussed in Section 3.3.2,one advantage of bi-directional LSTM over uni-directional LSTM is that it uses each event e u t,i ∈ S u t to predict recommendations based on the event's past and future contexts, rather than just its past ones (We have conducted an experiment of turning off the bi-directional LSTM in the ASLM model and using uni-directional LSTM instead.It turned out that their performances are almost the same, but the model with bi-directional LSTM converged faster).Therefore, compared to II-RNN without attention-based architecture, our model can take full advantage of each user's abundant long-and short-term behavior (especially short-term), to build a more comprehensive user profile and therefore provide a more precise recommendation.This indicates that ASLM is sensitive to the variation in the user's short-term intent and long-term preference through attention network when enough user history behavior is given.The reason for the huge improvements of ASLM made in Last.fm dataset is that, as shown in Table 1, each user in Last.fm has an overwhelming number of sessions (namely 645.62 sessions per user) which is approximately 10.4 times more than that in Reddit (62.15 sessions per user).Furthermore, the average session length in Last.fm is 2.7 times more than that in Reddit.Similar results are shown in Table 8 that the performances of ASLM for validation cases and that for testing cases in Reddit and Last.fm datasets are at the same level, and it assures us the validity of our model.
As expected for group B, the same results can be seen from Tables 6 and 7, that ASLM also outperforms all given methods under all measurements in 1/10 Reddit, 1/80 Last.fm and Tmall datasets.For Tmall dataset, the evaluation result of ASLM-AP shown in Table 7 improves 59.51%, 36.27% and 132.17% at R@10, R@20 and MRR@20 compared with the SWIWO-I method, respectively.Please note that for Tmall dataset, as shown in Table 1, there is scarcely enough sessions for each user (namely 3.99 per user) and number of average session length (2.99 per user) compared with that in Reddit and Last.fm.For 1/10 Reddit and 1/80 Last.fm, since the user history information is reduced dramatically compared with the full Reddit and Last.fm datasets, most of the models' performances significantly decrease accordingly.However, both the ASLM models still outperform the strongest baselines, as shown in Table 6.It demonstrates that ASLM can reach better performance even when it is severely short of both long-term and short-term user behavior.
The reason for the decrease of the relative score when K changes from 5 to 20 is that the ASLM model's performance is already good when K = 5, with a limited potential to improve further when K increases.As a result, the absolute scores of Recall and MRR will increase as K increases, while the relative scores of them against the strongest baselines will decrease throughout these tables.
In terms of the two versions of ASLM (ASLM-AP using average pooling and ASLM-LHS using the last hidden state), Tables 5 and 8 show that for Reddit and Last.fm datasets, the performances of both test cases and validation cases of ASLM-AP are similar to that of ASLM-LHS.The same result is shown in Tables 6 and 9 for 1/10 Reddit and 1/80 Last.fm datasets.For Tmall dataset, as shown in Table 7, the advantage of ASLM-AP becomes obvious.

Effectiveness on Session Cold-Start Problem
In addition to the overall recommendations we reported above, we evaluate the effectiveness of ASLM-AP on recommendations for the first n time steps, for n = 1, ..., 5, L, where L is the maximum session length and L = 19 in our experiment.Figure 8 shows the comparison of performances of Recall@5 at the first n recommendations of each session for Reddit and Last.fm datasets between ASLM-AP and II-RNN-AP.Notice that for Reddit and Last.fm,where abundant user history behavior is given, our attention-based ASLM-AP model achieves R@5 > 0.9 when n = 1, and under most circumstances, each R@5 of ASLM-AP model increases throughout the session.ASLM-AP has an enormous advantage over II-RNN-AP at the start of a new session in Reddit and Last.fm datasets, with the improvements of 164.03% and 726.01%, respectively.Although II-RNN-AP narrows the gap when more events are investigated in the current session, ASLM-AP holds the advantage to the end of the session with the improvements of 131.97% and 624.88% in Reddit and Last.fm datasets, respectively.Similar results are shown in Figure 9 for Tmall dataset.In terms of Tmall dataset, which is severely short of both long-term and short-term user behavior.However, ASLM still provides 0.4964 at Recall@5 with just a slight difference compared with the overall Recall@5 (i.e., 0.5050).As a result, the Attention-based Short-term Intent Layer can thoroughly reveal the variation in the user's short-term intent between events about current session S u t .To sum it up, ASLM significantly alleviates the cold-start problem.

Implications of the Work
This system is designed to address situations where only past decisions of the users are available (e.g., there are no ratings by the users of the items).This is a very common situation, and yet recommendations are still needed and can be produced by estimating the interest drift of user preferences from past choices across items, and the pattern and sequence of past choices across sessions and particularly in the current session.
The ASLM can be applied by recommender systems in the scenarios of news, music, and e-commerce (e.g., Reddit, Last.fm and Tmall, respectively) where the user's interest drift is the main topic since it takes place frequently.The model depends on the data availability because it is a data-driven method, and continuously requires the user's session data to process when running.During the experiment, it took about 28 min for the ASLM to run every epoch on Reddit dataset which contains 1,135,488 user sessions with a NVIDIA GTX 1080 Ti graphic card.In other words, it took about 1.5 ms for each session, and that is appropriate for real-time recommendation.Therefore, the system can be employed online and executed in real-time, although we conducted the experiment offline.

Conclusions
In this work, we proposed an attention-based RNNs named ASLM for next-item recommendation problems.By integrating both the user's short-term intent and long-term preference at the same time with a two-layer network, our model makes the best use of the awareness of the variation in the user's contextual preference, especially the short-term intent, to cope with the cold-start problem within sessions.We have conducted experiments on three real-world datasets and two sub-datasets and demonstrated considerable improvements over strong baselines.Moreover, our proposed model can provide accurate recommendation at the start of sessions, which remarkably mitigates the cold-start problem.

Figure 1 .
Figure 1.An example of Amazon recommender system.

Figure 2 .
Figure 2. ASLM architecture.Two mutually exclusive methods are shown: (a) ASLM-AP, and (b) ASLM-LHS.Only one method is used at a time.

Figure 3 .
Figure 3.The main part of attention mechanism.

Figure 4 .
Figure 4. Distribution histograms of the number of sessions by the frequency of users for Reddit and Last.fm datasets.

Figure 5 .
Figure 5. Distribution histogram of the number of sessions by the frequency of users for Tmall dataset.

Figure 6 .
Figure 6.Distribution histograms of the number of events by the frequency of sessions for Reddit and Last.fm datasets.

Figure 7 .
Figure 7. Distribution histogram of the number of events by the frequency of sessions for Tmall dataset.

Figure 8 .
Figure 8.Comparison of cold-start effects at the start of a new session for Reddit and Last.fm datasets between ASLM-AP and II-RNN-AP.

Figure 9 .
Figure 9. Cold-start effect at the start of a new session for Tmall dataset.Please note that there is no data provided for cold-start problem by SWIWO method.
y j jy <j ; s u t )

Table 1 .
Descriptive statistics for the datasets.

Table 2 .
Standard deviation and skewness of the number of sessions by the frequency of users for the datasets.

Table 3 .
Standard deviation and skewness of the number of events by the frequency of sessions for the datasets.

Table 5 .
Group A: Recall and MRR scores for the ASLM models and the baselines for the test sets in Reddit and Last.fm datasets.

Table 6 .
Group B: Recall and MRR scores for the ASLM models and the baselines for the test sets in 1/10 Reddit and 1/80 Last.fm datasets.

Table 7 .
Group B: Recall and MRR scores for the ASLM models and the baselines in Tmall dataset.

Table 8 .
Group A: Recall and MRR scores for the ASLM models and the baselines for the validation sets in Reddit and Last.fm datasets.

Table 9 .
Group B: Recall and MRR scores for the ASLM models and the baselines for the validation sets in 1/10 Reddit and 1/80 Last.fm datasets.