Comprehensive Empirical Evaluation of Deep Learning Approaches for Session-Based Recommendation in E-Commerce

Boosting the sales of e-commerce services is guaranteed once users find more items matching their interests in a short amount of time. Consequently, recommendation systems have become a crucial part of any successful e-commerce service. Although various recommendation techniques could be used in e-commerce, a considerable amount of attention has been drawn to session-based recommendation systems in recent years. This growing interest is due to security concerns over collecting personalized user behavior data, especially due to recent general data protection regulations. In this work, we present a comprehensive evaluation of the state-of-the-art deep learning approaches used in the session-based recommendation. In session-based recommendation, a recommendation system counts on the sequence of events made by a user within the same session to predict and endorse other items that are more likely to correlate with their preferences. Our extensive experiments investigate baseline techniques (e.g., nearest neighbors and pattern mining algorithms) and deep learning approaches (e.g., recurrent neural networks, graph neural networks, and attention-based networks). Our evaluations show that advanced neural-based models and session-based nearest neighbor algorithms outperform the baseline techniques in most scenarios. However, we found that these models suffer more in the case of long sessions when there exists drift in user interests, and when there are not enough data to correctly model different items during training. Our study suggests that using the hybrid models of different approaches combined with baseline algorithms could lead to substantial results in session-based recommendations based on dataset characteristics. We also discuss the drawbacks of current session-based recommendation algorithms and further open research directions in this field.


Introduction
Most e-commerce services use recommendation systems to help their customers find their items of interest based on their navigation behavior through these services. Recommendation systems are considered a category of information-filtering systems that aim to predict user preferences based on their behavior. They have become a crucial part of any successful business that helps satisfy user needs and boost the business sales volume [1]. Recommendation systems have been used in a wide range of domains including images [2], music [3], videos [4], and even news [5] recommendations.
Various types of recommendation systems have been proposed in the literature, categorized as time-aware and session-based recommendation systems. The former can adapt to the temporal dynamics and user preferences drift over time [6,7] and recommendation systems based on social information datasets [8,9]. The latter relies on the user navigation Deep neural networks are a subset of machine learning technologies that have attracted significant attention in the past decade. Such techniques have achieved outstanding performance in a wide range of domains, including natural language processing, medical diagnosis, speech recognition, and computer vision. In practice, the main advantage of deep learning over traditional machine learning techniques is their automatic feature extraction ability. This allows learning complex functions to be mapped from the input space to the output space without human intervention [11]. Recently, different approaches have been proposed to use deep neural networks in recommendation systems [12,13]. In particular, different deep learning models were used for modeling the sequence of the user navigation behavior in online services to be used in the next item recommendation [14][15][16][17]. These works showed a competitive performance compared to traditional approaches such as sequential pattern mining techniques, nearest neighbor algorithms, and traditional Markov models [18].
Few studies have been conducted to evaluate session-based algorithms. Jannach et al. compared the heuristics-based nearest neighbor baseline algorithm with a basic recurrent neural network (RNN) [19]. The results of this study showed that deep learning methods fall behind basic algorithms such as neighborhood methods. However, during the last couple of years, many advancements have been proposed using deep-learning in session-based recommendation, leading to the rise of several neural-based architectures. Ludewig et al. [18], for instance, conducted a study to compare many baseline algorithms in session-based recommendation using four datasets in the e-commerce field and four others in music and playlists recommendation. Although this study included only a single neural-based model and lacked the evaluations of different deep-learning approaches in the literature, it was recently extended to include state-of-the-art deep learning models [20]. However, the empirical evaluation conducted in [18,20] by training models on full datasets makes it difficult to understand the drawbacks of each model and why exactly a specific model outperforms others in a particular dataset.
In this paper, we extend previous studies [18,20,21] that compared the overall performance of 11 simple algorithms and 6 deep-learning models on four e-commerce datasets. In this work, we focus on studying the effect of varying the characteristics of a dataset on the performance of each model. In particular, our main contributions are as follows: • We carry out an extensive evaluation and benchmarking of the state-of-the-art neuralbased approaches in session-based recommendation, including recurrent neural networks, convolutional neural networks, and attention-based networks along with a group of the most popular baseline techniques in the recommendation field such as nearest neighbors, frequent pattern mining, and matrix factorization. Our experiments elaborate on the evaluation process based on various dataset characteristics. Hence, we divide the datasets according to the values of various characteristics such as session length, item frequency, and data sizes. These experiments revealed some insights that could help understand when some models are poorly performing and open new research horizons of improvements that are needed for each model, which were difficult to deduce from the previous studies. • An interpretable decision tree model is used to accurately recommend the bestperforming model according to the dataset characteristics. • Current drawbacks of session-based recommendation systems are discussed with proposed solutions to overcome these issues, which could yield better results in some domains.
We divide our benchmarking study into separate sets of experiments each aiming to answer a different research question. First, we evaluated the performance of different models against different session lengths and frequencies of items. Second, we investigated the effect of the recency of the collected data on the models' performance, which could help avoid data leakage problems and deceiving accuracy during training. Third, the effect of the training data size is evaluated for different models. Finally, we present a comparison of different approaches in terms of time and memory resource consumption during both training and inference. The aim of this study was mainly to determine and understand the main characteristics of the datasets that profoundly affect different models' performances by carrying out a micro-analysis evaluation for the session-based algorithms on real-world e-commerce datasets. This study could help improve the selection of the recommendation algorithm according to the target dataset and highlight the weaknesses of different models for further improvement.
The paper is organized as follows. In Section 2, a short survey of session-based recommendation systems is discussed. Section 3 presents a detailed description of different algorithms and models evaluated in our experiments. Section 4 describes the experiment setup and the research questions to be answered, and Section 5 shows the results and discussion of the evaluation experiments. Finally, in Section 6, the main insights of our study are summarized in addition to the thoughts of future research directions.

Review of Deep Learning Approaches in Session-Based Recommendation
The session-based recommendation is a particular type of sequence-aware recommendation that is a general class of recommendation systems. The decisions made by these systems are mainly based on the short-term user intention defined by a session. This session is represented by a set of user-item interaction pairs in a short period of time. Furthermore, various types of attributes can characterize these interactions, such as user attributes (e.g., gender and age), item attributes (e.g., color and size), and action types (e.g., add-to-cart and add-to-wish-list). The input of these recommendation systems is a chronologically ordered set of user-item actions and the output is a score-list of the ranking of items based on the likelihood that user preferences match these items [22]. Even though e-commerce is the most critical application for the session-based recommendation, there are many other applications such as recommendations for music playlists, films, and online course [23].
Early research works tackled session-based recommendation problems with the nearest neighbors and frequent pattern mining techniques [24]. However, these works are instance-based algorithms that take a significant amount of time to make predictions. Therefore, they are not suitable for real-time use cases, such as e-commerce. Later on, other research works proposed using more advanced techniques, such as Markov chain models in sequence modeling [25,26]. The problem of the state-space explosion in Markov models was treated using the attributes of some items to limit the space of the next items to be recommended [27]. Additionally, classical matrix factorization techniques were combined with Markov chains in different variations and applied in a wide range of domains as in [28,29].
In recent years, different deep learning approaches have been adopted in sessionbased recommendation. The main advantage of deep learning approaches is their ability to automatically extract features. This advantage allows learning complex functions to be mapped from the input space to the output space without human intervention [11]. For example, a neural-based model named GRU4Rec used gated recurrent units (GRUs) in RNNs to predict the next item to be clicked by the user [14]. The model was trained by minimizing the loss functions that include pairwise losses comparing the target item score with the maximal score among negative samples. The likelihood of these samples is taken into account in proportion to the target item's maximal score. The used losses showed excellent performance by correctly ranking the predicted items and overcoming the vanishing gradient problem in RNNs [30].
The GRU4Rec architecture was further extended by using a modified version of the original negative sampling approach, where the likelihood score of the next recommended item is calculated for a subset of items as it would be impractical to do it for the whole list of items [30]. The new sampling method uses additional negative samples shared by all the session sequences within the same mini-batch. Additionally, it updates a small percentage of the network weights for each mini-batch to make the training process faster. These samples were chosen based on the items' popularity, which gives more chances to include most of the high scoring negative examples. This approach leads to excellent improvement in the performance of the model. Furthermore, the same architecture was adopted to support multiple item features instead of unique identifiers only in a parallel training scheme. It was evaluated against item K-nearest neighbors showing a good improvement [31].
Furthermore, Quadrana et al. proposed a method for adapting RNN in personalized session-based recommendation with cross-session information transfer among user sessions using a hierarchical RNN model such that the output hidden state from the network for a particular session is passed as input to a higher level RNN for the next session of the same user [32]. A hybrid architecture of two RNNs was proposed for a personalized session-based recommendation that mainly aims to target the session cold-start problem by learning from the user's recent personal sessions [33]. Convolutional neural networks (CNNs) have also been used in session-based recommendation. In particular, Tuan et al. used a 3D-CNN with character-level encoding to combine session clicks with the textual descriptions of the items to generate recommendations [34]. Similarly, a generative CNN was proposed by embedding the clicked items into a two-dimensional matrix and treated as input images to the CNN [35]. Graph neural networks were recently used to capture complex transitions among items after modeling the sequence of events of a session as graph-structured data without adequate user behavior in a session [8]. Wang et al. proposed a novel framework using two parallel memory encoders to make use of the information of collaborative neighborhood sessions in addition to the current session information followed by a selective fusion of both encoders' output [36].
After discovering the attention concept in neural networks which leads to a great improvement in terms of neural machine translation tasks [37,38], attention networks were widely adopted in the session-based recommendation [16,39,40]. For instance, a hybrid encoder with attention is used to model user sequential behavior [39], which outperforms long-term memory models such as GRU4Rec [39]. Furthermore, a short term attention priority model was introduced such that attention weights are computed from the total session context and enhanced by the current user's interest represented by the last clicked item [16]. Additionally, Sun et al. adopted the current state-of-the-art BERT transformer network [41], widely used in the natural language processing domain, in personalized session-based recommendation [42]. Most neural-based solutions, in the session-based recommendation, generate a static representation for users' long-term interests. Such representation might be an issue as its importance in predicting the next recommended item is dynamic and also related to short-term preferences. Hence, a co-attention network was proposed to recognize the dynamic interaction between the user's long and short-term interests to generate a co-dependent representation of the users' interests [40]. However, the usage of the transformer networks in generalized session-based recommendations with the incorporation of item features are still open research areas. Table 1 summarizes current state-of-the-art neural network architectures for personalized/non-personalized session-based recommendation.

Detailed Evaluated Approaches
In this section, all the algorithms covered in our evaluation study, ash shown in Figure 2, are explained in detail, and for the sake of simplicity, the following notation in Table 2 is used throughout.

Nearest neighbors
x is one of elements in vector Y, and 0 otherwise

Baseline Approaches
We selected a set of five baseline algorithms to be included in this study based on the previous study in [18]. In particular, our selection was based on two different criteria. First, we selected at least one method from each family of algorithms, which showed excellent performance at different session-based recommendation tasks. Second, we chose the method with the best overall performance compared to other methods within the same family. Therefore, the selected algorithms are as follows: session-based popular products (S-POP) as a simple heuristic algorithm [45]; and simple association rules (AR) and simple sequential rules (SR) as representatives of frequent pattern mining algorithms [46]. Vector session-based K-nearest neighbors (VSKNN) [18] and session-based matrix factorization (SMF) [18] were selected from the nearest neighbors and factorization-based methods, respectively.

Session-Based Popular Products
S-POP is one of the most widely used baseline recommendation algorithms [45,47]. These algorithms make a recommendation based on the most frequent item viewed by the user in the current session. In short, if a user clicked on an item I n multiple times during the same session, this reflects a clear sign of the user's interest in that item. Hence, recommending the same item to the user again is a reasonable decision. In some cases, the S-POP recommendation process is limited to the top popular K items while ignoring the rest of the items. This constraint ensures that the recommended items belong to the most popular ones among all users.
The score of a specific item I n in a session S t is computed as follows:

Simplified Association Rules
Association rules (ARs) are a frequently used pattern mining approach that can capture the size for the frequency of patterns of events, N, and recommend the most frequent ones [46]. In the case of session-based recommendation, Ludewig et al. [18] used a simplified version of association rules of size N = 2 to have reasonable computational complexity. In their work, the occurrence of any two subsequent items (I i , I j ) at the same session S was stored. During prediction, the last item viewed by the user, x L t , was used to find all the candidate similar items by choosing the most frequent item pairs, (x L t , I n ) where n ∈ N. Therefore, an arbitrary item I n was recommended if it has a score among the top predicted ones. This score is computed as follows: 3.1.3. Simplified Sequential Rules Sequential rules are also a frequent pattern mining approach. Here, the order of the session events is taken into account in contrast with AR which depends on the support of the items only. A simplified form of sequential rules (SR) is used such that a rule is created between two items (I i , I j ) when they appear in sequential events [21]. Each rule in SR is assigned a weight that is a function of the linear distance between the items (I i , I j ) as in Equation (3). The rules between proximate events are assigned larger weights than rules between distant events. The scores of different items to be recommended can be evaluated using the following: where dis(j, k) = (1 − 0.1(j − k)) if j − k < 10 otherwise dis(j, k) = 0

Vector Multiplication Session-Based K-Nearest Neighbors
Nearest neighbor algorithms show excellent performance in session-based recommendation [21]. However, they have many different variant schemes which can be applied according to the domain type-like item-based nearest neighbors [48] which depends on predicting similar items to the last one viewed by the user. On the other hand, sessionbased nearest neighbors consider the viewed items in the whole session and try to find neighboring sessions with similar items to be used in predicting the next recommended items [24]. Ludewig et al. [18] evaluated the multiple variants of nearest neighbor algorithms. In their work, it has been shown that vector multiplication session-based K-nearest neighbors (VSKNN) outperformed pattern mining and matrix factorization methods in most evaluated datasets. Additionally, it has a competitive performance rivaling RNNs and even outperforms them in multiple datasets. VSKNN is considered to be one of the session-based nearest neighbors algorithms, where recent items clicked by the user take larger weights than older items. As such, more emphasis is given for the recent events made by the user. The score of an item I n to be recommended for the next item is computed as where the similarity distance, sim(S t , S i ), can be set to the cosine distance, and W t (S t ) is a weighting function of the items according to their positions in the session S t . This weighting function usually gives higher weights to the recently clicked items [18].

Session-Based Matrix Factorization
SMF is a matrix factorization-based approach designed for the task of session-based recommendation [18]. This approach was inspired by the factorized personalized Markov chains [29,49] for sequential recommendation tasks. In SMF, classical matrix factorization and factorized Markov chains are combined with a hybrid approach. In particular, the latent user vector was replaced by an embedding vector that represents the current session. During the prediction process, the score of a candidate item is computed as the weighted sum of the whole session preferences and the sequential dynamics representing the transition probability from the last clicked item by the user to the candidate item to be recommended by the model. We used the model implementation by Ludewig et al. [18]. The SMF showed a better performance than other factorization-based methods over multiple datasets.

Deep Learning Approaches
Many deep learning architectures were proposed in the literature for session-based recommendation. These architectures vary in the types of their layers. For instance, Hidasi et al. [14] presented the first study using RNNs with GRUs in session-based recommendation. Tuan et al. [34] and Yuan et al. [35] used convolutional networks in modeling the session context. Li et al. and Liu et al. [16,39] proposed different attention mechanisms to enhance the performance of RNNs. Recently, Wu et al. [17] exploited graph neural networks in session-based recommendation.
In our study, we limited the selection to only include the current state-of-the-art and well-cited architectures proposed in the range of the last four years, published in top tier venues, and that were implemented as open source. Additionally, we refined our list to select the models that can be used in making generalized (non-personal) predictions without the need to collect a personal user profile to easily comply with the GDPR requirements (Section 1). The final list of the chosen architectures includes the neural item embedding algorithm (Item2Vec) proposed by Barkan et al. [43], extended version of GRUs neural networks (GRU4Rec+) by Hidasi et al. [50], neural attentive network (NARM) by Li et al. [39], graph neural network proposed by Wu et al. [17], short-term attention priority network (STAMP) by Liu et al. [16], as well as convolutional generative network for session-based recommendation (NextItNet) [35], and collaborative neural network with parallel memory modules (CSRM) proposed by Wang et al. [36].

Neural Item to Vector Embedding
Barkan et al. introduced a conversion for the items into embedding vectors in a latent space based on the session context of clicked items. This idea is an adaptation of the Word2Vec algorithm that converts the words into a vector space in an efficient way that enhances neural machine translation task performance by having two close vectors for similar words used in the same context [51]. Similarly, Item2Vec uses the skip-gram with negative sampling neural word embedding to determine vector representations for different items that infer the relationship between an item and its surrounding items in a session. During the prediction phase, candidate items obtain scores according to the similarity distance between their embedding vectors and the average of the embedding vectors of the session items [43].

Gated Recurrent Neural Networks for Session-Based Recommendation
One of the first successful approaches for using RNNs in the recommendation domain is the GRU4Rec network [14]. An RNN with GRUs was used for the session-based recommendation. A novel training mechanism called session-parallel mini-batches is used in GRU4Rec, as shown in Figure 3. Each position in a mini-batch belongs to a particular session in the training data. The network finds a hidden state for each position in the batch separately, but this hidden state is kept and used in the next iteration at the positions when the same session continues with the next batch. However, it is erased at the positions of new sessions coming up with the start of the next batch. The network is always updated with the session beginning and used to predict the subsequent events.
The GRU4Rec architecture is composed of an embedding layer followed by multiple optional GRU layers, a feed-forward network, and a softmax layer for output score predictions for candidate items. The session items are one-hot-encoded in a vector representing all items' space to be fed into the network as input. On the other hand, a similar output vector is obtained from the softmax layer to represent the predicted ranking of items. Additionally, the authors designed two new loss functions, namely Bayesian personalized ranking (BPR) loss and regularized approximation of the relative rank of the relevant item (TOP1) loss.
BPR uses a pairwise ranking loss function by averaging the target item's score with several negative ones sampled in the loss value. TOP1 is the regularized approximation of the relative rank of the relevant item loss. Later, Hidasi et al. [30] extended their work by modifying the two-loss functions previously introduced by solving the issues of a vanishing gradient faced by TOP1 and BPR when the negative samples have a very low predicted likelihood that approaches zero. The newly proposed losses merge between the knowledge from the deep learning and the literature of learning to rank. The evaluation of the new extended version shows clear superiority over the older version of the network. Thus, we included the extended version of the GRU4Rec network, denoted by GRU4Rec+, in our evaluation study.

Neural Attentive Session-Based Recommendation
NARM is one of the session-based recommendation systems based on sequence modeling using an attention mechanism [39]. The main advantage of this model is introducing a solution to long-term memory models such as GRU4Rec (Section 3.2.2). The model is characterized by hybrid encoders with an attention network to model the user sequential behavior and capture the main purpose of the session combined as a unified session representation.
NARM architecture has two types of encoders: 1. GRU network represents the global encoder, which takes the entire previous user interactions during the session as input and produces the user's sequential behavior as output.

2.
Local encoder that is a GRU network similar to the global encoder. However, its role is to involve an item-level attention mechanism to allow the decoder to dynamically select a linear combination of different items from the input sequence, and focus more on important items that can capture the user's main purpose within a particular session.
Finally, both encoders' outputs are concatenated with each other to form an extended representation of the session. They are fed again into a bi-linear decoder along with item embedding vectors to compute the similarity score between the current session representation and candidate items to be used in ranking the items to be predicted next.

Short-Term Attention/Memory Priority Model
STAMP is one of the approaches that replaces complex recurrent computations in RNNs with self-attention layers [16]. The model presents a novel attention mechanism in which the attention scores are computed from the user's current session context and enhanced by the sessions' history. Thus, the model can capture the user interest drifts, especially during long sessions and outperform other approaches like GRU4Rec [14] that uses long term memory but still not efficient in capturing user drifts. Figure 4 shows the model architecture where the input is two embedding vectors (E L , E S t ). The former denotes the embedding of the last item x L clicked by the user in the current session, which represents the short term memory of the user's interest. The latter represents their overall interest through the full session clicked items. The E S t vector is computed by averaging the items embedding vectors throughout the whole session memory (x 1 , x 2 , . . . , x L ). An attention layer is used to produce a real-valued vector E a , where this layer is responsible for computing the attention weights corresponding to each item in the current session. As such, we avoid treating each item in the session as equally important and paying more attention to only related items, which improves the capturing of the drifts in the user interest. Both E a and E L flow into two multi-layer perceptron networks that are identical in shape but have separate independent parameters for feature abstraction. Finally, a trilinear composition function, followed by a softmax function, is used for the likelihood calculation of the available items to be clicked next by the user and to be used in the recommendation process.

Simple Generative Convolutional Network
NextItNet was proposed to use convolutional neural networks in the session-based recommendation. The session made by a user is converted into a two-dimensional latent matrix and fed into convolutional neural network-like images [35].
NextItNet is considered as an extension of the recent convolutional sequence embedding recommendation model (Caser) by Tang et al. [44]. However, NextItNet addresses the two main limitations of applying CNNs in sequence modeling in Caser, which are obvious in long sessions. First, the items sequences in a session can have a variable length, which means that a large number of different size images are needed to represent a session. Consequently, fixed-size convolutional filters may fail in dealing with such cases. However, large filters with a filter width similar to the image width of an item inside the session sequence, and followed by max-pooling layers, are used to ensure that the produced feature maps have the same length. Second, these small filters are not able to find well-representing embedding vectors for the session items. In NextItNet, a huge number of inefficient convolutional filters are replaced with a series of one-dimensional dilated convolution layers. The dilated layers are responsible for increasing the receptive field and dealing with different session lengths instead of the standard 2D convolution layers. Thus, the max-pooling layers are omitted as they cannot distinguish the important features in the map if they occur once or multiple times while ignoring the position of these features. Additionally, NextItNet effectively makes use of the residual blocks in the recommendation systems, which can ease the optimization for much deeper networks than the shallow convolutional network in Caser that cannot model complex relations between items in a user session.

Session-Based Recommendation with Graph Neural Networks
SRGNN was introduced recently by Wu et al. [17]. The session sequences are modeled as a graph-structured data and the graph neural network (GNN) task is to capture the complex transitions among items. This architecture was proposed to solve mainly two problems with other approaches. First, most other models cannot estimate the user interest without adequate interactions in a session. Secondly, most of the models focus on singleway transitions between items and neglect transitions among the context instead.
Each session is modeled as a separate sub-graph. In this sub-graph, a node represents an item, and an edge represents a user interaction with that item. Session S t in Figure 5 is shown as an example of session sub-graph. Each edge is assigned a normalized weight calculated by the division of the edge occurrence by the out-degree of that edge's starting node. Then, using an attention network, each session sub-graph is proceeded one by one through a gated GNN to produce an embedding vector for each node. The role of SRGNN is to capture the complex transitions in the session context and generate accurate corresponding item embedding vectors. This method can be adapted if the nodes of the items have multiple features such as price, color, size, and brand by concatenating them with the node embedding vector. Furthermore, the session embedding vector adds information about the session's local embedding vector defined by the last clicked item vector, which is E 7 in Figure 5, and the global embedding vector E S t defined by the aggregation of all the previous items vectors. This hybrid embedding approach performs a linear transformation over the concatenation of both the local and global embedding vectors, followed by a softmax layer to predict the next item probabilities.

Collaborative Session-Based Recommendation Machine
A hybrid framework applying collaborative neighborhood information to sessionbased recommendation was proposed by Wang et al. [36] who hypothesized that neighborhood sessions to the current session can contain useful information in improving the recommendation system predictions-even those made by different users.
The architecture implementation, shown in Figure 6, includes two main encoders. First, the inner memory encoder models the user behavior during their current session using an RNN with an attention mechanism fed with the hidden state of the network from the previous layer h t−1 , and current session S t items. This encoder outputs two concatenated vector embeddings C Inner of the current session behavior representing the whole session items, and the key items clicked during the session. Second, the outer memory encoder looks for the neighborhood sessions that contain patterns similar to the current session out of a subset of the recently stored sessions (S t−1 , S t−2 , . . .), which are used to enhance the recommendation process. The final output from the outer memory encoder C Outer represents the influence of other sessions' representations in the neighborhood memory network M in the current session. The final current session representation C t is formed by a selective fusion between both encoders' output. Finally, the output scores for all items are predicted using a bi-linear decoding scheme between the embedding of item I i , and the final representation vector of the current session C t , followed by a softmax layer. The other two main advantages in CSRM are:

1.
Storing recent sessions and looking for neighborhoods within these sessions can be beneficial, especially in e-commerce where temporal drifts in user interests occur frequently.

2.
Ease of including different item features in the item embedding vector, which can enhance the recommendations' accuracy. . . . . . .

Datasets
All experiments were based on benchmark datasets in the e-commerce domain: The first dataset was collected by YOOCHOOSE (https://www.yoochoose.com/ accessed on 1 August 2022) incorporation and published in RecSys Challenge 2015 (http: //2015.recsyschallenge.com/ accessed on 1 August 2022). The dataset contains a collection of sessions from a retailer, where each session includes the click events that the user performed in the session. The data were collected during ≈ 6 months in 2014, reflecting the clicks and purchases performed by the users of an online retailer in Europe. The main characteristics that distinguish this dataset from others are having the largest number of clicks and the smallest number of items, which leads to a high presence of most of the items in the dataset. Following the previous literature, we used the last day sessions as a testing set and the rest sessions as a training set. This dataset is referred to as RECSYS [14,50].

Diginetica
Diginetica dataset was used in CIKM Cup 2016 (https://cikm2016.cs.iupui.edu/cikmcup/ accessed on 1 August 2022) for the personalized e-commerce search challenge. The dataset was provided by DIGINETICA (http://diginetica.com/ accessed on 1 August 2022) corporation containing anonymized search and browsing logs, product data, and anonymized transactions collected for five months from e-commerce websites. We used the transaction data only in our experiments. Similar to the RECSYS dataset, we used the last-day sessions as a testing set and the remaining sessions as the training set. We use the name CIKMCUP to refer to this dataset in the rest of this paper.

TMall
TMall is a large dataset that consists of interaction logs from the e-commerce TMall website (https://www.tmall.com/ accessed on 1 August 2022). The dataset was collected in six months, including the user-item views logs; however, the time recorded for each event was at the granularity of days. Thus, we used transactions made by the same user in one day as one session, which leads to much longer sessions than the other datasets. Due to the constraints in the computational resources, we used only the dataset in the range from the beginning of September to the end of October (two months) as the training set, and the subsequent day (first of November) as the testing set. We refer to this dataset as TMALL.

Retail Rocket
Finally, the retail-rocket dataset was collected and published by retail-rocket e-commerce personalization company (https://retailrocket.net/ accessed on 1 August 2022) aiming to motivate studies in the field of recommendation systems. The dataset includes user behavioral data from a real-world e-commerce website throughout ≈ 4.5 months such as views, add-to-carts, and transactions in addition to items identifiers and their properties in a hashed format. Only the views and add-to-cart events were considered in our experiments, while transaction events were discarded. This dataset and the CIKMCUP dataset were characterized by the small number of clicks compared to the number of existing unique items. Additionally, they also have fewer sessions than both the TMALL and RECSYS datasets. We used the last two days as the testing set and the rest of the sessions as the training set. We refer to this dataset as ROCKET.
During the preprocessing of all datasets, we filtered out sessions of length one as they do not include enough items for evaluation. Additionally, we filtered the clicked items in the test sets, which do not exist in the corresponding training sets in all the experiments. Multiple consecutive clicks on the same item in one session are replaced by a single click on that item. This step was performed as it does not make sense to recommend the same item currently viewed by the user, and it is always preferable to recommend new related items. For example, a session of a click sequence of (1, 1, 1, 2, 2, 3, 4, 4, 1) is preprocessed to (1,2,3,4,1). Ignoring this step, as in previous studies [18,20], falls in favor of baseline methods such as nearest neighbors and frequent pattern mining over neural-based methods. We kept all the items in the training set and did not remove low-frequency items. During the evaluation, we computed the accuracy of recommendations on all the possible splits starting from the first click of every single session. For instance, in a session represented by the vector (1, 2, 3, 4), we evaluated the recommendations on the session of a single click on (1) with target item 2, the (1, 2) session with target item 3, and the (1, 2, 3) session with target item 4. Finally, the average performance measurements are reported. The statistics of the datasets after the preprocessing are summarized in Table 3

Experiments Description
Our study included eight different sets of experiments repeated for each model on all the evaluated datasets. The aim of these experiments was to answer the following research questions (RQs): • RQ1: Different training session lengths: We aimed to evaluate which models can learn from short sessions in length during the training process, and which ones can make better use of lengthy sessions to accurately specify the user's interest. To answer this RQ, we divided each training dataset into three different splits according to their length. We only kept sessions of length <5 in the first split, ≥5 and <10 in the second split, and ≥10 for the third one. We chose these thresholds as it sounds challenging to determine a correct session context with <5 clicked items. Sessions of length >5 and <10 have an adequate number of items to determine the user's preferences. Sessions with >10 items also have more than enough items to model the user's preferences, but it is also more likely to have a drift in the session context that may or may not be captured by the model. This selection was also made following Liu et al. [16], who chose a threshold of 5 to distinguish between short and long sessions as all datasets have an average session length that is close to 5, as shown in Table 3. All the models were trained on each split of the training sessions. However, the evaluation was performed on the testing splits extracted from the original datasets without further pruning. • RQ2: Different testing session lengths: In this experiment, we measured the models' performances with different session lengths during inference. The main target of this experiment was to observe the model performance during the start of the session and after having an adequate number of interactions from the user. This experiment could help determine which models cannot perform well at long sessions, which usually include user drifts in preferences.
On the contrary to the previous RQ, we fixed the same training dataset and divided the test sets to three different splits of sessions of maximum length of 5, 10, >10, respectively. All models are trained on the same training set and evaluated on each test split. • RQ3: Prediction of items with different popularity in the training set: In this set of experiments, we investigated how the models' performance changes concerning the items' frequency in the training set. Answering this RQ can help determine which models can learn well from less frequent items in the training set and accurately predict them during evaluation. Additionally, this experiment can show how models are biased towards predicting the more popular items. In this experiment, we divided the test sets of each dataset to only keep items whose frequencies do not exceed a specific threshold in the training set. The frequency thresholds used for different splits were (50, 100, 200, 300, >300) for the RECSYS and TMALL datasets, and (10, 30, 60, 100, >100) for the CIKMCUP and ROCKET datasets. This categorization was chosen based on the distribution of the items' frequency in the training sets such that each category of a range of frequencies has an adequate number of items (>1000 items) covering the whole range of frequencies, as shown in Figure 7. All models were trained on the same training set, and the evaluation metrics were computed on each set of items in the testing set, satisfying the above frequency threshold conditions. • RQ4: Effect of data recency: In this experiment, we divided our training sets into three data portions. Each portion was collected during a period that is equal in length to the others but different in terms of the creation date (recency). The first portion represents the most recent collected data sessions. The second one represents the eldest sessions, and the last one is a mixture between the more recent half of the first portion and the older half of the second portion. All models were evaluated on the same test set. We aimed in this experiment to show whether it is crucial to have a dynamic time series modeling to be taken into account while fitting the different models or seasonal changes cannot make a significant drift in the user preferences in the e-commerce domain. For example, in fashion e-commerce, users tend to look for light clothes during summer, which makes it unlikely that they will learn from data collected during winter, when users usually have more preference for heavy clothes. Additionally, it is quite essential to determine how data recency could lead to data leakage problems that could result in a deceptive model accuracy during training. • RQ5: Effect of training data size: We aimed by this experiment to observe the models' performance on different dataset sizes. Answering this RQ can help understand the suitable dataset size corresponding to the number of available items in a dataset such that the model performance is not profoundly affected. Consequently, we save more computational resources without impairment in the models' performance. We randomly selected different splits of the original training datasets such that the sizes of these splits are equal to 1 P of the original training set size where P ∈ 2, 8, 16, 64, 256. • RQ6: Effect of Training Data Time Span: Additionally, to check whether the results obtained from RQ4 and RQ5 can be generalized, we ran a similar experiment to RQ5; however, instead of selecting random portions of training sets, we divided them according to the time taken during the data collection. For example, we only used the most recent sessions collected during the last m days before the period used as a testing set to train the model. In this experiment, we used m ∈ 2, 7, 14, 30, and we aimed to know the time span required to train different models and achieve the best performance according to the different data set properties such as the number of items, and average session length. • RQ7: Items Popularity and Coverage: What are the coverage and popularity of the items of each model on the fixed dataset splits? Given the models' predictions, we computed the coverage and popularity of these predictions out of the total number of unique items in the dataset. These measurements can provide a good indication of the models' tendency to predict the most frequent items only, or they can cover the space of items to a large extent. The coverage of the model predictions is a measure of what is called aggregate diversity and how the model is adapted to different sessions' contexts [52]. A small coverage value shows that the model is always recommending a small set of items for all users, as the most popular or frequent items in the training set. A high coverage shows that it recommends a wide range of items with different sessions context [20]. The coverage can also be shown and confirmed by the popularity metric that computes the average frequency of occurrence of the predicted items from the training set normalized by the frequency of the most popular item. We used the full original training and testing sets of CIKMCUP and ROCKET datasets. Additionally, we used a random split of In this experiment, we reported the different computational resources required by each model to observe the trade-off between the model performance and its complexity and if it is worth having more computationally expensive models than simpler ones. Additionally, we aimed to find the suitability of different models to be practically used in making real-time predictions.
The properties of all training and testing splits used in each experiment are summarized in Table A1. We used an early stopping approach during training with a validation split of 10% of the training split for the deep learning models. Additionally, as hyperparameter optimization is an essential part of determining the performance of models, we ran a random search of 20 iterations for all models on each dataset to tune the most effective hyper-parameters suggested to be tuned by their authors or based on our own experiments. However, we kept the rest of the networks' hyper-parameters as their default values, as mentioned in their corresponding papers. We selected the hyper-parameter settings achieving the highest HR@20 for each dataset. The list of tuned hyper-parameters for each model, along with their ranges, can be found in Table A9.
Our work was carried out in part in the high-performance computing center of the University of Tartu (https://hpc.ut.ee/ accessed on 1 August 2022) In case that graphical processing unit (GPU) is used for neural network models, we used NVIDIA Tesla P100 GPU. The memory size was limited to 20GB RAM, Intel(R) Xeon(R) CPU E5-2660 v2 @ 2.20GHz processors with up to 30 cores were allocated from the computing center to run the models that do not support GPU. During reporting training and testing time and memory consumption in our results, we did not use any GPUs to make the comparison fair among all models. However, all neural-based models support the use of GPUs, which is a significant advantage over other algorithms. The source codes used in this study, and the logs of the results, were made publicly available (https://github.com/mmaher22/iCV-SBR accessed on 1 August 2022).

Evaluation Metrics of Models Performance
We measured the performance of all models in our experiments using several evaluation metrics:

1.
Hit Rate (HR@K) is the rate of matching the correct item clicked by the user with any of the list of predictions made by the model. The metric value is set to 1 if the target item is present among top K predictions and 0 otherwise. The formula of HR@K for a dataset D is described as follows: 2.
Mean reciprocal rank (MRR@K) is the average of reciprocal ranks of the target item if the score of this item is among the top K predictions made by the model. Otherwise, the reciprocal rank is set to zero [14]. The computation of MRR@K is given by 3. Item coverage (COV@K) is the measurement of how the model predicts a variety of items and not only biased to a small subset of frequent items. The item coverage is the ratio of the number of unique items predicted by the model to the total number of items in the training set. Given that K is the number of top predictions to be considered from the model for each session, item coverage is described as 4. Item Popularity (POP@K) is a representation of how the model tends to predict popular items. This metric can reveal models that achieve good performance based on the popularity of certain items in the training set instead of recommending items that match the session context and the user's preferences. The item popularity is the ratio between the average of the predicted items' frequencies to the frequency of the most popular item in the training set. Given that K is the number of top predictions to be considered from the model for each session, item popularity is given by

Results
In this section, we report and discuss the results obtained from our extensive evaluation for the different models trying to answer the different research questions proposed in Section 4

RQ1: Different Training Session Lengths
In the presented diagrams, we report the results for the HR and MRR of each model on all the evaluated datasets. On the contrary to some previous work such as that by Ludewig et al., who used a prediction cut-off threshold of 20 recommendations [18], we chose to set the number of predictions cut-off to five as it is more reasonable to recommend around five items to the user in a real-use case. Furthermore, 20 recommendation is a large number to be used in real-life e-commerce scenarios. However, full results for different predictions cut-off thresholds (1,3,5,10,20), in this experiment and all the following ones, can be found in our online repository (https://github.com/mmaher22/iCV-SBR/tree/master/Results accessed on 1 August 2022).
As shown in Figure 8, most neural models outperform the non-neural baseline models, except for S-POP, which is the best model in the TMALL dataset due to the nature of the dataset where the clicked items in one session are more frequently repeated within the same session than in other datasets. TMALL has an item frequency per session of 1.204 on average compared with 1.097, 1.115, 1.103 for RECSYS, CIKMCUP, and ROCKET, respectively. This difference means that it is more likely that the same item appears multiple times in a single session in the TMALL dataset. Furthermore, VSKNN has a relatively good performance in the ROCKET dataset. However, the top three models in terms of either HR or MRR in RECSYS and CIKMCUP datasets are always neural-based. Additionally, two neural models are always among the top three performing models in the TMALL and ROCKET datasets. NextItNet has the highest performance in the RECSYS dataset characterized by the highest average item frequency among the used datasets. This property is more suitable for convolutional networks that require a large number of sessions covering all items to model them correctly. However, there is a small decrease in the performance of CSRM, SRGNN, and STAMP when training using a short session length, which is apparent in RECSYS and TMALL that have mostly intermediate to long sessions in the corresponding testing sets. On the other hand, GRU4Rec+ has the highest performance when training using an intermediate and short session length. In contrast, this performance degrades on long sessions since a drift in the user preferences is more likely to occur. Although NARM also uses a similar network to GRU4Rec+, it has a better performance thanks to the attention layers in its architecture. This improvement in performance was significant compared with a baseline of the GRU4Rec network, which suffers from the vanishing gradient problem, especially in long sessions of 11-17 clicks [39]. Regarding baseline models such as S-POP, AR, and SR, there is a big and consistent change in their performance while changing the training sessions' length, especially in the RECSYS dataset. However, it is clear that these models have comparable performances to neural-based models in datasets where the average item frequency is small and insufficient for learning good item representations such as in CIKMCUP, ROCKET, and TMALL datasets. Overall, baseline methods including S-POP, VSKNN, and SMF are more suitable for short sessions when there are not enough session events to represent the user's preferences. On the other hand, as the session includes a larger number of events, more complex models become better than baselines. This improvement is only apparent for intermediate sessions compared with short sessions. However, long sessions have similar or slightly worse performance since the user preferences are more likely to change in the long sessions. The number of these sessions is not sufficiently large to train the neural models well, especially in the CIKMCUP and ROCKET datasets. However, the total number of events in all training splits of these datasets is close since long sessions have an higher average session length than short ones, as shown in Table A1. In contrast, the TMALL dataset has many more events in the long sessions than intermediate and short ones. This difference in split size could contribute to the noticeable improvement in the performance of the neural models trained with long sessions such as NARM, as also discussed in Section 5.5.

RQ2: Different Testing Session Lengths
While training using different session lengths gives useful insights about the performance of models, this performance is still highly correlated with the length of the testing sessions. For example, if a model is trained with short sessions while most of the testing set sessions are long, it will not make accurate predictions. Figure 9 shows the performance of different models using the same training set for all of them while choosing a subset of sessions according to their length as a testing set. S-POP has an increasing performance while the session goes longer since longer sessions have a higher probability of the user clicking on the item again that they liked previously in the same session. However, in lengthy sessions, personalization still pays off since the session has adequate information to precisely model the user preferences [22]. The performance of SR and AR degrade consistently by a large degree, for all the datasets as the session length increases. This impairment is due to the small window of interest that both these algorithms look at while computing the frequent patterns of items, which means that they will not make use of longer sessions. Although the window size for the computation of frequent patterns could be increased, the computational complexity grows exponentially, making it infeasible to use them with long windows sizes. The same effect holds for VSKNN when the selected weighting function for the clicked items in the session gives much higher weights to the recent items such as the quadratic, multiplicative inverse, and log weighting functions. Thus, the weights given to the remaining items outside a specific window size are almost neglected. Additionally, almost all the neural models' accuracy decreases at long sessions, which shows that it is still one of the challenges to model the user drift of interests during the same session. This decrease in performance is very clear in NextItNet, Item2Vec, and GRU4Rec+ among all datasets, while it is slightly less observable for CSRM, STAMP, and NARM, which could be due to the memory and attention mechanisms applied in these models. NextItNet never outperformed in the RECSYS dataset compared with the results obtained in RQ1. This decrease in performance is due to the smaller training split size used in the experiments of RQ2 compared with those used in RQ1. This claim is also confirmed in the experiments of RQ5 and RQ6, where the deep architecture of convolutional layers in NextItNet requires many instances per item to model it adequately compared with attention networks.
Overall, it seems that achieving excellent performance in long sessions is still a challenging problem for most of the models. On the one hand, pattern mining models do not pay attention to long sessions. They only make use of a small window frame around the item of interest while ignoring the information made earlier by the user in the same session. On the other hand, neural models cannot still accurately detect user-drifts and might suffer from vanishing gradient problems in RNNs, especially for very long sessions [30,39]. Interestingly, much research has focused on a solution to the cold-start problem when users do not make enough clicks to capture their preferences. However, it seems that more attention also needs to be paid to improve the recommendation performance for the long sessions. Figure 10 shows the performances of different models when trying to predict an item below a specific threshold of occurrences in the training set. This experiment shows how model performance is affected by the number of items' occurrences during fitting. In this experiment, we used frequency thresholds of (<50, <100, <200, and <300) for RECSYS and TMALL datasets which have higher average item frequency, and (<10, <30, <60, <100) for CIKMCUP and ROCKET datasets. The performances of AR, SR, SMF, and GRU4Rec+ are always improved by increasing the frequency threshold among all datasets. To a lesser extent, NextItNet and SRGNN have a slight gain in performance by increasing the frequency threshold where this gain is stopped at very high frequencies as in TMALL and ROCKET datasets. On the other hand, NARM, STAMP, CSRM, Item2Vec, and VSKNN performance measurements do not have a consistent trend while increasing the frequency threshold. In general, some models can have a better performance by increasing the items' occurrences in the training set to be able to accurately model these items like AR and SR. On the contrary, some other models do not need this high frequency of occurrences, and it is enough to be represented only a few times in the training set as NARM, STAMP, and CSRM-which all have various attention mechanisms.

RQ4: Effect of Data Recency
Using session-based recommendation models in e-commerce always requires being upto-date with sufficiently recent data to model the current users' trends. In this experiment, we tested our hypothesis by training the models using the sessions collected from the most recent five days, the five earliest days, and a mix of half of the recent and half of the old splits (in ROCKET), in which we used ten days instead of five as the dataset is smaller than the rest. The test set was fixed for all these different training splits. In Figure 11, it is shown that it is always preferable to train the models using the most recent sessions. It is consistent among all datasets that old sessions have an observable lower performance along almost all the models than recent and mixed splits. Although, there is no large difference between the models' performance on the recent and mixed splits, especially in RECSYS and ROCKET datasets, there is still a small difference in favor of recent splits for CSRM, SRGNN, and NARM in CIKMCUP and TMALL datasets. Surprisingly, VSKNN is the only model with higher performance on old splits in two out of the four datasets and comparable performance in the other two datasets. This behavior could be interpreted as the algorithm only caring about the neighboring sessions of exactly similar items as those clicked by users, which indeed results in better recommendations if matching sessions were found. Additionally, VSKNN has a better overall performance than other models in the ROCKET and CIKMCUP datasets as both of them are characterized by a lower average item frequency than the RECSYS and TMALL datasets, as discussed in RQ3. It is worth mentioning that although a similar trend was observed for the models over each dataset, the differences in each dataset are not equal because some datasets span different time periods. For example, GRU4Rec+ has a higher HR@5 performance in the recent split than the old split by ∼ 7% in the RECSYS dataset that spans six months. However, this difference is just ∼ 0.5% in the TMALL dataset that spans two months. Hence, the time difference between the old and recent splits in RECSYS is much bigger than that of the TMALL dataset, which could explain the differences in performance among all datasets.
These results confirm that we should account for time-series dynamic modeling in the session-based recommendation to model the trends in users' preferences. Additionally, in case the collected data are much older, there is a high chance that the nearest neighbor algorithms outperform other models.

RQ5: Effect of Training Data Size
In this experiment, we aimed to determine the suitable training data sizes corresponding to the different datasets with various characteristics. Additionally, we investigated how the evaluated models performed while using these different training sets sizes. We divided ROCKET and CIKMCUP into splits of ( 1 2 , 1 8 , 1 16 , 1 64 ) of the original training set size. TMALL and RECSYS were divided into ( 1 8 , 1 16 , 1 64 , 1 256 ) splits as they are bigger. We refer to these splits as (large, medium, small, very small), respectively. Figure 12 shows a heat map for the HR@5 and MRR@5 of different models while increasing the training set size. S-POP and VSKNN are the only algorithms that do not benefit from larger data sizes. It can be easily observed that VSKNN achieved its highest performance along with the four datasets when using the very-small training data portions while S-POP has the same behavior except for the RECSYS dataset. All the neural models' performance is increased when using more training data sessions, which agrees with the nature of deep learning models that are data-hungry. However, SRGNN, NextItNet, and GRU4Rec+ are consistently getting better when increasing training data sizes over all the datasets, while NARM, STAMP, and CSRM are less improved than the former models. Although there is a small improvement in the performance of AR, SR, and SMF in the RECSYS dataset, this improvement is not clear enough in other datasets to generalize the same observation.

RQ6: Effect of Training Data Time-Span
Getting some insights out of RQ4 and RQ5 about the importance of training data recency and sizes should reveal sufficient information about the length of the time span required to collect training data sessions. Similarly to RQ5, Figure 13 illustrates a heat-map of the HR@5 and MRR@5 metrics when training using splits of the most recent x days from the full training set for each dataset, where x = 2, 7, 14, 30. Similar to what was previously shown in RQ5, in Figure 12, VSKNN and S-POP still have the best performance when training using a time-span of just two days. Additionally, the performance of neural models becomes better with an increasing dataset time-span. However, in RECSYS dataset, the model improvement is almost ceased as it has a small number of items, and the number of sessions in a 2-day-time-span is quite sufficient to be used to accurately model the context of the sessions.

RQ7: Items Coverage and Popularity
Item coverage and popularity are good indications of how models tend to cover the space of items in the training set in making recommendations. A model with small coverage and high popularity means that it tends to predict the same items for all users, regardless of the session's context. Figure 14 shows the natural logarithm of items' coverage (COV@5) and popularity (POP@5) using the same training and testing splits for each of the datasets. In general, a similar trend is observed among all the datasets comparing the baselines and neural models. For instance, S-POP has the lowest coverage and highest popularity since it only predicts the most frequent items. On the other hand, Item2Vec has the highest coverage and lowest popularity. However, this is not the case in most real-life scenarios. There are usually some popular items that the users usually click on, such as the items with high discounts. As such, Item2Vec still has the lowest performance in terms of HR and MRR since its output vectors are usually dispersed in the vector space, and simple distance measurements are sufficient to capture the similarity among the session context and vectors of the items [53]. Regarding baselines, AR, SR, and VSKNN have quite similar coverage and popularity except for CIKMCUP where VSKNN has higher item coverage. SMF has smaller coverage and popularity. Regarding the neural models, GRU4Rec+ has slightly higher coverage for its predictions, followed by NARM, STAMP, SRGNN, and CSRM, however, these differences are too small and can barely be observed. CSRM has a memory for storing the most recent sessions, and it predicts items based on the neighborhoods within these sessions. As such, it is always biased towards a subset of recently clicked and popular items than other models. On the other hand, NextItNet has the lowest item coverage with comparable popularity to NARM and CSRM, which suggests that this model is more likely to be over-fitted to a small subset of items, and needs better regularization approaches to be applied to the model. NextItNet is characterized by the presence of the convolutional filters in its architecture, which require much more occurrences per item to generalize well compared with the attention-based networks [54]. SRGNN and STAMP have smaller average popularity across their predictions than other neural models over all datasets. Additionally, NARM has a high item coverage and high popularity with a relatively high accuracy performance according to the HR and MRR metrics. This performance suggests that NARM has the advantage of recommending a wide range of items according to the different sessions' context. However, using SRGNN and STAMP could still be preferable if they have similar accuracy performance to other models as they cover more unpopular items in their recommendations. Detailed results for other predictions' thresholds in this experiment can be found in Table A8 in the Appendix B.

RQ8: Computational Resources
It is quite important to have a short testing time to make predictions quickly as it is required to provide the user with recommendations in real time after making a specific action. Simultaneously, training computational complexity is important in terms of the scalability of the model and the ability to train it easily every short period of time. Figure 15 summarizes the computational complexity of different models during both training and testing phases in the RECSYS and CIKMCUP datasets. S-POP, AR, SR, and VSKNN are instance-based algorithms where the learning process occurs during inference by iterating over the training set for each test instance. Consequently, the computational resources for training these models are almost neglected. On the other hand, they take a very long during inference, which means that they are not suitable to be employed in making realtime predictions. However, they do not consume much memory as only the dataset and very few parameters are required to be stored. Item2Vec and SMF have quite long training and testing times. Although Item2Vec is considered a neural model, during inference, the similarity distance is computed between the predicted session embedding vector and the items' vectors, which takes a long time. Additionally, SMF is a matrix factorization algorithm that performs heavy matrix multiplication operations during both training and inference. These operations are computationally expensive in terms of both time and memory consumption. Regarding neural models, all of them have a relatively high training time and memory resources; however, they are still characterized by a short time during inference. they only need only a single forward pass to make predictions for one batch of instances. This performance suggests that neural-based models are suitable for real-time predictions. The differences in the training and inference time of the neural models are proportional to the size of each network, the number of layers, and the types of these layers. STAMP has the lowest training and testing time consumption as it is the smallest model in size, followed by GRU4Rec+, CSRM, and then NARM and SRGNN in ascending order. NextItNet is characterized by the presence of multiple convolutional layers and a relatively large model which has high memory consumption due to the mapping of the sessions into images. However, NextItNet has a smaller testing time than NARM, SRGNN, and CSRM due to the weight-sharing properties of the convolutional layers. Additionally, SRGNN, a graph neural network, has a relatively large memory and time consumption due to the large size of the graph network created by mapping the items and sessions into the corresponding nodes and edges. Overall, all neural models are more compatible with the requirements of real-time predictions. However, they need ample computational resources during training using the back-propagation scheme compared with the simple baseline algorithms.

Interpretable Meta-Model for Best Model Predictions
Based on our empirical study, we trained a decision tree of a maximum depth of 6 levels and a minimum impurity split of 0.3 to keep it simple and interpretable. This tree model is used to predict the best outperforming model based on dataset characteristics. We used all the experiments that we carried out in our study to construct a new tabular dataset. The features listed in this dataset include the number of sessions, average session length, and average item frequency in both the training and testing sets. We set the target variable as the best performing model out of the whole list of the evaluated models according to the MRR@5 evaluation metric. These models are distributed as 14, 36, 7, 4, 10, 5, and 10 instances for the S-POP, VSKNN, NARM, STAMP, NextItNet, SRGNN, and CSRM, respectively. On the other hand, the remaining models did not outperform all the data splits used in our study. Our dataset was divided into ten cross-fold training and hold-out splits. The same decision tree was fitted to each training split to achieve an average accuracy of 87.17% and 87.5% on the training and hold-out splits, respectively. The visualization of one of these fitted trees and the class distribution in the dataset can be found in Figure A1 in the Appendix B.
The most important features used in determining the outperforming model turned out to be in the following order: the average item frequency in the training set, the average session length in the testing set, the number of sessions in the training set, and the number of items in the training set. This simple tree model supports our previous findings of how different dataset characteristics can affect the performance of different models, and choosing the best one. In practice, such interpretable models can help the user shorten the list of models that are more likely to perform well given the characteristics of the given dataset. Additionally, they can help assign weights for different models' predictions if an ensemble of multiple models is used for recommendation using the recommendations corresponding to each model. This experiment shows the potential of finding similar interpretable models that help in developing rules that guide the user to choose the suitable models for a specific dataset.
Our study suggests that using different models according to the different datasets' characteristics could lead to a better performance in the session-based recommendation task. Similar approaches to our decision-tree meta-model can predict which models will perform well with different dataset properties. This information can help combine the predictions out of multiple candidate models, which will consequently improve the final set of recommended items. Our dataset can also be easily extended with more e-commerce datasets that can increase the meta-model accuracy and reliability of predicting the best models.

Overall Performance
To judge the overall performance of the different models, we used a box-plot in Figure 16 to summarize the ranks of the examined models along with the evaluated datasets. Each chart represents a comparison among the ranks of the models in all the experiments related to one particular dataset. The model with the best performance (highest HR@5/MRR@5) takes a ranking of one, and the one with the worst performance takes a ranking of twelve. In general, NARM, SRGNN, and CSRM are the top three neuralbased models in terms of both HR@5 and MRR@5 in all datasets. VSKNN has a good performance in the ROCKET dataset with the smallest average session length among all datasets. In contrast with previous studies [18,20], VSKNN has a worse performance than expected. When we investigated the reasons for this performance impairment, we found out that the preprocessing steps carried out in our study by removing consecutive clicks on the same items, keeping the items of low frequency, and the different evaluation procedures are the main reasons for the differences from these studies. S-POP has the best performance in the TMALL dataset with the largest number of items and average session length. NextItNet has a good performance only in the RECSYS dataset with the smallest number of items and largest average item frequency in the training set, which means that most items are well-represented in the training set by many times. In Table  A8, in the Appendix B, the different metrics are evaluated for 1, 5, 20 predictions cut-off thresholds. Overall, the performance of neural-based models was greatly improved with the new different architectures that emerged in the session-based recommendation. This improvement can also be observed when comparing the performance of neural-based models in older studies [18,19] compared with more recent ones [20]. Hence, neural-based models have a comparable accuracy performance to the most developed nearest neighbor algorithms, and yet more research is needed to further extend these models. When comparing the results obtained from this study with previous benchmarking studies such as [18][19][20], it can be observed that the relative overall performance of the evaluated methods on the whole datasets is not always the same. Furthermore, the performance of the models among these studies changes by doing slightly different preprocessing steps. For example, the performance of the STAMP model drops considerably in [17] compared with the reported performance in [16] on the same evaluated datasets (RECSYS and CIKMCUP). For this reason, we find it difficult to draw general conclusions about the relative performance of the evaluated models. Additionally, there are very few publicly available real-world datasets in e-commerce. Hence, this suggests that understanding every single model's performance concerning the different dataset characteristics could provide the user with more insights to help them select the model that suits their dataset better in an interpretable way. Having a large number of real-world datasets covering the whole space of these characteristics sounds practically impossible. Thus, herein, we rely on creating artificially altered data splits out of the original datasets that could better understand the performance of the evaluated models.

Main Insights
In this study, we investigated the current state-of-the-art neural-based models in addition to other baseline algorithms for the session-based recommendation task. Different experiments were carried out trying to answer a set of research questions covering different characteristics of the evaluated datasets, in the e-commerce domain, during both training and testing phases. We used different evaluation metrics covering the accuracy of the models' recommendations, the coverage of predicted items, and their average popularity. Additionally, the consumption of computational resources during training and inference was discussed in terms of the suitability to real-life e-commerce portals.
In general, neural-based models with attention mechanisms such as NARM and CSRM in addition to recurrent models including GRU4Rec+ and the simple VSKNN algorithm are the top-performing models among the majority of datasets with different characteristics. Additionally, the neural-based models are characterized by having reasonable training time budgets and real-time processing during inference. Our results suggest that the training data recency and sizes have an observable effect on the prediction accuracy during inference. In e-commerce, it is clear that dynamic time modeling is a crucial part that needs to be further investigated and included in session-based algorithms to model general trends through different periods. Additionally, dataset characteristics such as average session length, average item frequency, and the total number of sessions do have an impact on the models' performance.
Baseline models including nearest neighbors still outperform all other models even when they have relatively small training sizes or short sessions. Additionally, most of the models' performance degrades slowly on very long sessions, which suggests the need to improve the models' performance in these cases and accurately detect drifts in the user's preferences while making efficient use of older events in the session. In some cases, baseline algorithms outperform neural models; however, due to the computational complexity of these algorithms, especially during the inference time, neural-based models are preferable in making real-time recommendations.

Challenges and Future Work
Despite the recent leap achieved in improving the performance of neural-based methods in the session-based recommendation, many challenges still need to be tackled with new solutions. As future work, we suggest the following research points that help the community better understand the current models, and tackle these challenges with new solutions: • E-commerce domain is usually characterized by frequent changes in item properties. For example, the sale campaigns on some items can heavily affect the users' interest. In addition, temporal changes such as weather changes in different seasons, and trends in fashion items can also lead to significant drifts in user preferences. Thus, it is quite important to start looking for models that can deal with possibly different types of items attributes, which are either nominal, numerical, or categorical, to improve the prediction accuracy. Additionally, temporal changes in these attributes should be taken into account while predicting different items. The possible effects of these trends were previously analyzed using an e-commerce use case [55]. However, this has been barely explored in the literature due to the lack of publicly available datasets including sufficient relevant information. Thus, more effort should be made to collect and publish such types of datasets that can help the research community better analyze these trends with different domains and a wide range of scenarios. • Most current solutions require unique item identifiers to be used during training and prediction phases. However, in many domains, having a fixed set of items is not a feasible solution. For example, new items can be added, and others can run out of stock in the e-commerce domain. Thus, training new models can be a tedious solution, especially for large datasets. Additionally, models can severely suffer from the cold-start problem for those recently added items. Research work needs to investigate how to use the concept of the dynamic item embedding [56] to be utilized in both the training and inferences phases instead of using a fixed set of unique identifiers. • The current session-based recommendation systems do not take into account the different user interactions made during the session. On the one hand, different events such as item view, add-to-cart, and add to wishlist show different levels of interest from the user towards the items. On the other hand, other interactions can account for drifts in the user preferences including remove-from-cart and remove-from-wishlist. We believe that modeling such kinds of different interactions in a general way can lead to an improvement in the session-based recommendation. • Although tuning the models' hyper-parameters can be computationally expensive, it is quite important for further studies to perform extensive experiments on the most promising models to investigate the effect of changing different architecture hyper-parameters. Previous studies have shown that it is always the case that few hyper-parameters have a significant impact on the models' performance [57,58]. As a candidate solution to reduce the search space to be investigated, many studies have introduced solutions to automated machine learning, including the neural network architecture search and hyperparameter optimization, which help carry out these studies more fairly and efficiently [59]. • The extensive evaluation of deep learning approaches in session-based recommendation in domains other than e-commerce, such as music playlist recommendations, has not yet been investigated. As different domains' characteristics can affect the properties of collected data, and the performance of different models, it is quite important to answer similar research questions to those also investigated here other domains. Additionally, other evaluation metrics could also be computed to reveal new interesting information about the behavior of the models and the diversity of their predictions [52]. For example, the Gini index could be evaluated for the models' recommendations to understand whether the predictions are biased towards some items more than others. • In session-based recommendation, it is typical to train a model using the sessions collected during a specific period and evaluating the model using the sessions collected in the subsequent days to that period. Although this approach was always followed in our study, it is also possible to experiment using different training-test splits in some experiments such as RQ4, RQ5, and RQ6, where a random set of sessions were chosen from the entire list of existing sessions. One of the limitations of this work is that we only used a single training-test split due to our experiments' computational complexity. Hence, as a future work, evaluating the results of some research questions using multiple training-test splits can be used to confirm and generalize our main conclusions.  Institutional Review Board Statement: Not applicable.

Informed Consent Statement: Not applicable.
Data Availability Statement: Not applicable.

Acknowledgments:
The authors also gratefully acknowledge the support of NVIDIA Corporation with the donation of the Titan X Pascal GPU.

Conflicts of Interest:
The authors declare no conflict of interest.      Table A8. Training using random (1/16, All, 1/16, All) portions of the full training set for (RECSYS, CIKMCUP, TMALL, ROCKET) datasets, respectively. Evaluation is performed on the original testing sets. The HR, MRR, coverage, and popularity metrics with different cut-off thresholds are reported. Highest performance is made in bold for each dataset.