Multi-View Graph Contrastive Neural Networks for Session-Based Recommendation

Huang, Pengbo; Wang, Chun

doi:10.3390/math13091530

Open AccessArticle

Multi-View Graph Contrastive Neural Networks for Session-Based Recommendation

by

Pengbo Huang

and

Chun Wang

^*

Faculty of Data Science, City University of Macau, Macau 999078, China

^*

Author to whom correspondence should be addressed.

Mathematics 2025, 13(9), 1530; https://doi.org/10.3390/math13091530

Submission received: 18 March 2025 / Revised: 13 April 2025 / Accepted: 21 April 2025 / Published: 6 May 2025

(This article belongs to the Special Issue Advanced Image Processing and Computational Intelligence: Methodologies and Applications)

Download

Browse Figures

Versions Notes

Abstract

Session-based recommendation (SBR) aims to predict the next item a user may interact with based on an anonymous session, playing a crucial role in real-time recommendation scenarios. However, existing SBR models struggle to effectively capture local session dependencies and global item relationships, while also facing challenges such as data sparsity and noisy information interference. To address these challenges, this paper proposes a novel Multi-View Graph Contrastive Learning Neural Network (MVGCL-GNN), which enhances recommendation performance through multi-view graph modeling and contrastive learning. Specifically, we construct three key graph structures: a session graph for modeling short-term item dependencies, a global item graph for capturing cross-session item transitions, and a global category graph for learning category-level relationships. In addition, we introduce simple graph contrastive learning to improve embedding quality and reduce noise interference. Furthermore, a soft attention mechanism is employed to effectively integrate session-level and global-level information representations. Extensive experiments conducted on two real-world datasets demonstrate that MVGCL-GNN consistently outperforms state-of-the-art baselines. MVGCL-GNN achieves 34.96% in P@20 and 16.50% in MRR@20 on the Tmall dataset, and 22.59% in P@20 and 8.60% in MRR@20 on the Nowplaying dataset. These results validate the effectiveness of multi-view graphs and contrastive learning in improving both accuracy and robustness for session-based recommendation tasks.

Keywords:

session-based recommendation; graph neural networks; contrastive learning; multi-view graph modeling; attention mechanism

MSC:

68T07

1. Introduction

Recommendation systems play a crucial role in various online platforms, especially in helping users find the content of interest from massive amounts of information. Traditional recommendation methods such as CF [1] typically rely on long-term user–item interaction data. However, these methods often perform poorly when user information is insufficient or unavailable. To address this issue, session-based recommendation (SBR) [2] has been proposed, which aims to predict items of interest to users based solely on anonymous behavior sequences within the current session.

Early SBR methods can be broadly categorized into similarity-based models [1] and sequence-based models [3]. Similarity-based models rely on co-occurrence information of items within the current session but neglect sequential behavior patterns. Sequence-based models infer user preferences by predicting possible behavioral paths between items, yet their computational cost becomes prohibitive when the number of items is large. With the development of deep learning, recurrent neural networks (RNNs) have been successfully applied to session-based recommendation tasks, such as GRU4REC [4] and NARM [5]. These methods leverage pairwise item-transition relationships to model user preferences and have achieved significant progress. However, RNN models are limited in their ability to capture complex item-transition patterns.

Recently, graph neural networks (GNNs) [6] have been introduced into session-based recommendation. For example, SR-GNN [7] utilizes graph structures to capture long-range dependencies between items; GC-SAN [8] combines GNNs with self-attention mechanisms; FGNN [9] models the information flow between items through a multi-weight graph attention layer. Wang et al. proposed GCE-GNN [10], which expanded the scope of user preference modeling by integrating item-transition information from both the current and historical sessions, achieving significant performance improvements. MAE-GNN [11], on the other hand, employs multi-head attention mechanisms to capture user preferences from multiple dimensions. Despite the strong performance of GNN-based methods in SBR tasks, they still face certain limitations. For example, existing models may introduce irrelevant session information when modeling global user preferences, or suffer from imbalanced positive and negative samples during training. Moreover, the fusion of information from local and global graph embeddings still has room for improvement.

On the other hand, self-supervised learning (SSL), particularly contrastive learning (CL) [12], has gradually become a research hotspot in recommendation systems. Zhou et al. [13] applied SSL to recommendation systems and improved representation learning through mutual information maximization. Xia et al. [14] leveraged dual-view learning to model intra- and inter-session information, enhancing model performance. However, in SBR tasks, selecting the appropriate contrastive learning perspective remains a challenging problem due to the limited information available in each session.

To address these issues, we propose a novel Multi-View Graph Contrastive Learning Neural Network (MVGCL-GNN). Unlike traditional methods, MVGCL-GNN comprehensively models multi-view information by constructing a global category graph, a global item graph, and a session graph. The global category graph captures category-level correlations, the global item graph aggregates item-transition patterns across sessions, and the session graph models item dependencies within the current session. By introducing graph-based contrastive learning, MVGCL-GNN reduces noise interference and improves embedding quality when learning global and local item representations. Furthermore, MVGCL-GNN adopts an attention-based embedding fusion mechanism to enhance interactions between global- and session-level information, providing richer feature representations for user preference modeling.

The main contributions of this work are summarized as follows:

We significantly improve model performance in complex session scenarios by performing contrastive learning on the information from global and session graphs.
We introduce a unified modeling of category and item graphs, enabling multi-dimensional representations of user preferences.
Extensive experiments on multiple real-world datasets demonstrate that MVGCL-GNN outperforms state-of-the-art methods on various evaluation metrics.

2. Related Work

2.1. Session-Based Recommendation

Session-based recommendation (SBR) has gained significant attention for its ability to model user behavior within sessions and provide real-time recommendations. Early approaches primarily relied on classical data mining techniques, such as item-based collaborative filtering [1,15] and Markov models [3,13,16]. These methods focused on identifying sequential patterns or co-occurrence relationships among items in historical data. For example, Markov chains, such as FPMC [16], were used to model user behavior by capturing transitions based solely on the previous interaction. However, these methods suffered from limitations like data sparsity and an inability to model long-term dependencies.

With the rise of deep learning, recurrent neural networks (RNNs) [4] became popular in session-based recommendation due to their ability to capture sequential dependencies. GRU4Rec [4], proposed by Hidasi et al., was the first model to apply gated recurrent units (GRUs) to session data, significantly improving recommendation performance by learning session representations through sequential item interactions. Subsequently, models like NARM [5] combined attention mechanisms to better capture user intent, while STAMP [17] adopted a simpler multi-layer perceptron (MLP) architecture to model both short-term and long-term preferences. Recent models, such as SR-IEM [18] and DSAN [19], also leveraged attention mechanisms to prioritize relevant items within sessions.

Despite the success of sequential learning methods in capturing dependencies, they often overlooked non-sequential dependencies and more complex item relationships. To address this, graph neural networks (GNNs) have been widely adopted in session-based recommendation. By modeling session sequences as graph-structured data, GNNs capture intricate relationships between items. Models like SR-GNN [7] and GCE-GNN [10] construct session graphs and utilize GNNs to capture transition relationships between items within sessions. These approaches extend beyond traditional sequence modeling by considering item dependencies, thereby enhancing the ability to represent user interests effectively.

2.2. Category-Aware Recommendation

In session-based recommendation, incorporating item attributes (e.g., category information) has become an important research direction, as it captures latent dependencies often ignored by traditional item-based models. For instance, models like CaSe4SR [20] and CM-HGNN [21] construct heterogeneous graphs that combine item interactions with category information to improve recommendation performance by modeling relationships among item categories. These approaches further enhance the expression of user interests by considering the potential impact of items within the same category on user behavior.

The importance of category-aware recommendation is reflected in models such as CoCo-Rec [22], which employs attention mechanisms to model intra-category item relationships. Other models, such as CA-GNN [23], construct graphs to represent both intra-category and inter-category relationships, learning category-level embeddings to guide the recommendation process. These models demonstrate that category information can significantly improve recommendation accuracy by providing additional context beyond raw item interactions.

2.3. Self-Supervised Learning

Self-supervised learning (SSL), as an emerging learning method, has shown great potential in enhancing recommendation models, particularly by leveraging unlabeled data and auxiliary tasks. In session-based recommendation, SSL has been applied to improve representation learning processes, with contrastive learning emerging as a mainstream approach [12]. By introducing self-supervised objectives that maximize mutual information between different views of the data, SSL helps models learn robust and generalized embeddings.

For instance, models like SimGCL [24] and its variants (e.g., the SGL [25] model) employ contrastive learning to maximize the similarity between positive pairs while minimizing the similarity between negative pairs. This approach enhances the model’s ability to distinguish relevant and irrelevant items, thereby improving the quality of session representations. Contrastive learning has also been integrated with GNNs to further capture complex item relationships, enabling models to better differentiate subtle differences in user behavior.

Recent research has also explored using SSL in the context of multi-graph structures. Zheng et al. proposed Self-Contrastive Learning (SCL) [26], which optimizes the distribution of item representations in a simpler yet effective manner without requiring additional data augmentation. This enables the model to better adapt to complex graph structures and improve cross-graph item embedding learning. These methods effectively extract richer features and representations from unlabeled data through SSL, further boosting model performance.

2.4. Fairness and Trustworthiness in Graph-Based Recommendation

As session-based recommendation systems are increasingly deployed in real-world applications, concerns regarding fairness and trustworthiness have become more prominent. Fairness-aware recommendation seeks to ensure that different user groups receive equitable treatment in terms of exposure, relevance, and opportunity. Recent studies, such as the comprehensive survey by Jin et al. [27], have systematically explored algorithmic fairness in recommender systems, proposing frameworks to mitigate bias arising from data imbalance or model design.

Similarly, the robustness and reliability of graph neural networks (GNNs) in recommendation tasks have drawn significant attention. Zhang et al. [28] reviewed trustworthy GNN techniques, emphasizing aspects such as robustness to noise, explainability, privacy, and adversarial attacks. These considerations are especially relevant in graph-based recommender systems where noisy or adversarial data may distort recommendation outcomes.

Although our MVGCL-GNN model focuses primarily on improving accuracy and representation learning through multi-view graph modeling and contrastive learning, it currently does not explicitly address fairness or robustness concerns. However, due to its modular design, MVGCL-GNN can be extended with fairness-aware objectives or regularization strategies to mitigate potential biases in future work. Similarly, robustness to graph perturbations can be improved by integrating adversarial training or certified defenses, as suggested in recent trustworthy GNN studies.

2.5. Limitations of Previous Work

Despite the substantial progress made in session-based recommendation, many existing models face several notable limitations. Traditional methods such as GRU4Rec [4], NARM [5], and STAMP [17] focus primarily on modeling sequential dependencies within sessions. However, these approaches are limited in their ability to capture complex item relationships that span across sessions or reflect global user behavior patterns.

Graph-based models such as SR-GNN [7] address this issue by modeling session data as graphs, allowing them to better capture item transitions within sessions. GCE-GNN [10] further extends this paradigm by introducing a global item graph, enabling the model to learn from cross-session co-occurrence patterns. However, it still focuses mainly on structural item transitions and does not explicitly consider higher-level semantic information such as item categories, which may provide crucial context, especially in sparse or noisy sessions.

Some recent approaches have attempted to integrate additional side information. For instance, CaSe4SR [20] and CM-HGNN [21] leverage category-aware graph structures to enrich item representations, but they do not explicitly model the interplay between different views (e.g., session-level, item-level, and category-level) in a unified framework.

Moreover, most of the above methods do not consider utilizing self-supervised learning techniques to enhance embedding quality. Although methods like SGL [25] introduce contrastive learning into collaborative filtering, their adaptation to session-based multi-view graph contexts remains limited. These limitations highlight the need for a unified, multi-view modeling framework that can simultaneously capture heterogeneous dependencies and enhance embedding representation through self-supervised objectives.

To further illustrate the positioning of our proposed method in comparison with existing work, Table 1 summarizes representative models across four key dimensions: graph-based structure, multi-view capability, category awareness, and the use of contrastive learning. As shown in the table, MVGCL-GNN is the only method that simultaneously supports all four aspects, highlighting its comprehensive design and innovation.

3. Preliminaries

We first formally define the problem and then introduce three different graphs describing the global category graph, the global item graph, and the session graph, which are designed to capture item relationships at different levels.

3.1. Problem Statement

Session-based recommendation (SBR) aims to predict the next item a user might click on based on an anonymous sequence of actions within a short time frame. Let the item set be

V = {v_{1}, v_{2}, \dots, v_{m}}

, and each session be represented as

S = {v_{1}^{s}, v_{2}^{s}, \dots, v_{l}^{s}}

, where

v_{i}^{v}

denotes the i-th item clicked by the user in the session, and l is the session length. The objective of SBR is to generate a ranked list of items from V, where the top-N items (

1 \leq N \leq | V |

) are considered the most likely candidates for recommendation.

3.2. Global Category Graph, Global Item Graph and Session Graph

To effectively model the relationships between items at different levels, we constructed the global category graph, global item graph, and the session graph to learn item embeddings. As shown in Figure 1, the session graph models the adjacency relationships between items within the current session, helping to capture local item transition patterns. Meanwhile, the global item graph captures the co-occurrence relationships of items across sessions, providing global-level item embeddings. Figure 2 illustrates the global category graph, which models the co-occurrence relationships between categories at a global level. By aggregating item embeddings within each category, the global category graph effectively enhances the representation of category-level features, thereby improving the understanding of item context.

3.2.1. Global Item Graph

The global item graph is designed to capture co-occurrence relationships between items across all sessions, thereby learning global item embeddings. Unlike traditional deep learning-based sequential models, graph neural networks (GNNs) can capture higher-order item relationships through the construction of the global item graph.

Co-occurrence-based heuristics are adopted in our graph construction due to their strong interpretability and effectiveness in modeling implicit item transitions in session-based data. This approach provides a reliable structure for capturing collaborative patterns without requiring explicit user identifiers, making it especially suitable for anonymous or short-session settings common in real-world applications.

In the global item graph, the neighbors of each item node are defined by an

ϵ

-neighbor set, representing item co-occurrence across different sessions. Specifically, for an item

v_{i}^{p}

in a session

S_{p}

, its

ϵ

-neighbor set is defined as

\begin{matrix} N_{ϵ} (v_{i}^{p}) = \\ {v_{j}^{q} ∣ v_{i}^{p} = v_{i^{'}}^{q} \in S_{p} \cap S_{q}; v_{j}^{p} \in S_{q}; j \in [i^{'} - ϵ, i^{'} + ϵ]; S_{p} \neq S_{q}}, \end{matrix}

(1)

where

v_{i}^{p}

is both the i-th term in session

S_{p}

and the

i^{'}

-th term in session

S_{q}

, and

ϵ

is a hyperparameter that controls the modeling scope of item transitions. Each item

v_{j}^{q}

in

N_{ϵ} (v_{i}^{p})

contributes to the

ϵ

-neighbor set.

Based on the

ϵ

-neighbor set, the global item transitions are defined as

{(v_{i}, v_{j}) ∣ v_{i}, v_{j} \in V; v_{j} \in N_{ϵ} (v_{i})} .

(2)

To improve computational efficiency, the global item transitions are considered undirected.

The global item graph

G_{g} = (V_{g}, E_{g})

is defined as an undirected weighted graph, where

V_{g}

represents the node set containing all items in V, and

E_{g} = {e_{i j}^{g} ∣ (v_{i}, v_{j}); v_{i}, v_{j} \in V; v_{j} \in N_{ϵ} (v_{i})}

represents the edges between co-occurring items across all sessions. Additionally, the edges of each node

v_{i}

are weighted based on their co-occurrence frequency in all sessions, reflecting the importance of the relationship. For each item node

v_{i}

, only the top-k most important edges are retained to emphasize significant relationships.

Although effective, this co-occurrence-based structure may be less robust in domains with sparse or noisy interactions. As future work, we aim to explore adaptive graph construction techniques such as graph structure learning (GSL) or incorporating semantic similarity (e.g., based on textual or image metadata) to enhance generalizability across domains.

3.2.2. Session Graph

The session graph models adjacent items in the current session to learn local-level item embeddings. Inspired by SR-GNN and TA-GNN, each session sequence is modeled as a directed session graph, with GNNs used to learn item embeddings for the current session. The session graph is defined as

G_{s} = (V_{s}, E_{s})

, where

V_{s}

represents the item set and

E_{s}

represents the edge set. Each edge denotes the relationship between two adjacent items

(v_{i}, v_{j})

, known as session-level item transition patterns. Similar to prior work, each node has a self-loop to incorporate its own information during subsequent modeling.

Inspired by the GCE-GNN model, the session graph contains four types of edges:

r_{i n}

,

r_{o u t}

,

r_{i n - o u t}

, and

r_{s e l f}

. For an edge

(v_{i}, v_{j})

, the following definitions apply:

$r_{i n}$ : A unidirectional transition from $v_{j}$ to $v_{i}$ .
$r_{o u t}$ : A unidirectional transition from $v_{i}$ to $v_{j}$ .
$r_{i n - o u t}$ : A bidirectional transition between $v_{i}$ and $v_{j}$ .
$r_{s e l f}$ : A self-loop for $v_{s i}$ , representing self-transition information.

3.2.3. Global Category Graph

The global category graph is constructed similarly to the global item graph, with the goal of capturing co-occurrence relationships at the category level to learn category-level item embeddings.

The motivation for constructing the global category graph is to incorporate high-level semantic priors (i.e., category groupings) into the learning process. This structure compensates for the limited item co-occurrence in sparse sessions by providing aggregated signals from category relationships, which have proven useful in alleviating cold-start and data sparsity issues [20,21,22].

Let the category set be

C = {c_{1}, c_{2}, \dots, c_{k}}

, where each item v belongs to a category

c_{i}

. The global category graph is defined as

G_{c} = (V_{c}, E_{c})

, where

V_{c}

represents the category node set containing all categories, and

E_{c}

represents the edges between categories based on their co-occurrence across sessions.

Specifically, edges between category nodes are derived from item co-occurrence relationships within sessions, reflecting the frequency of category co-occurrence. For each category node

c_{i}

, its neighbor set is defined as the set of categories that frequently co-occur with

c_{i}

in user sessions. The edge weights encode these co-occurrence frequencies, highlighting the most influential category relationships.

In the global category graph, each category’s embedding is generated through GNN, which not only aggregates the embeddings of items belonging to that category but also propagates information from neighboring categories. This process captures higher-order semantic relationships between categories, allowing the model to learn category-aware item representations. These category embeddings enhance the model’s understanding of category dependencies, thereby improving recommendation performance. To ensure that the category graph retains meaningful semantic relationships while minimizing noise, we retain only the top-k most significant edges for each category node based on their co-occurrence frequency in user sessions. This sparsification step helps the model focus on the most relevant category relationships, strengthening its ability to learn effective category-aware representations.

While co-occurrence-based links are straightforward and data-driven, they may overlook latent semantic relationships between categories. Future improvements could involve external ontologies or hierarchical embeddings to uncover deeper inter-category dependencies and further enhance the robustness of category-level representations.

4. Proposed Method

4.1. Methodological Rationale

The design of MVGCL-GNN is guided by the need to comprehensively model various forms of dependencies in session-based recommendation. Traditional RNN-based models such as GRU4Rec and NARM primarily capture local sequential signals within sessions, while early GNN-based approaches like SR-GNN focus on item transitions without considering global patterns or semantic context.

To overcome these limitations, we propose a multi-view framework that integrates three complementary perspectives: session-level short-term transitions, global item co-occurrence, and category-level semantic relationships. This setup allows the model to learn both fine-grained and high-level patterns that affect user behavior. We further introduce contrastive learning to enhance embedding robustness, especially under data sparsity or noise.

We also considered simpler architectures that use only the session graph or a single global graph (global item graph or global category graph). However, these designs performed worse in our ablation studies, indicating that they lack the flexibility and depth needed to fully capture user behavior. For this reason, our final model combines multiple views and learning strategies to achieve a better balance between accuracy and robustness.

4.2. Model Overview

We propose a novel model for session-based recommendation called MVGCL-GNN. The model aims to capture user preferences in the current session by modeling item transitions at both the session level and the global level, thereby providing more accurate recommendations. Figure 3 illustrates the architecture of the proposed model, which consists of five main components:

(1) Global-level Item Embedding Learning Layer: This component learns global-level item embeddings across all sessions based on the global item graph structure. It utilizes Graph Convolutional Networks (GCNs) combined with a contrastive learning mechanism to enhance the embeddings.

(2) Session-level Item Embedding Learning Layer: Using Graph Attention Networks (GATs) on the session graph, this component learns session-level item embeddings for the current session.

(3) Global-level Category Embedding Learning Layer: A global category graph is constructed to model the relationships between item categories. Graph Convolutional Networks are applied to recursively aggregate the embeddings of each node from its neighbors, thereby learning category-level item embeddings. Additionally, a contrastive learning module is introduced, where noised embeddings are generated and compared against true embeddings to enhance the robustness of category embeddings and mitigate the impact of noise.

(4) Embedding Fusion Layer: This layer fuses global category embeddings, global item embeddings, and session-level item embeddings. A novel attention mechanism is employed to process the fused features to capture user preferences in the current session.

(5) Prediction Layer: This component outputs the predicted scores for candidate items to be recommended.

Next, we will elaborate on each component in detail.

4.2.1. Global-Level Item Embedding Learning Layer

Following prior works, we adopted a global item embedding learning module based on Graph Convolutional Networks (GCNs). This module is designed to extract item transition information from the global item graph, providing rich contextual support for the recommendation task. Items in the global item graph may appear in multiple sessions, and the transition information across these sessions is highly valuable for the current recommendation task. To fully utilize this information, we combine session-aware attention mechanisms with message propagation and aggregation operations to extract item embeddings from the global item graph.

We perform a weighted aggregation of the features of the neighboring nodes to generate the neighborhood feature representation of v as follows:

h_{N_{v_{i}}^{g}} = \sum_{v_{j} \in N_{v_{i}}^{g}} π (v_{i}, v_{j}) h_{v_{j}},

(3)

where

π (v_{i}, v_{j})

is the attention weight representing the importance of different neighbors, and

h_{v_{j}}

is the unified embedding representation of node

v_{j}

. To highlight the relevance to the preferences in the current session, we compute the attention weight

π (v_{i}, v_{j})

using a session-aware attention mechanism as follows:

π (v_{i}, v_{j}) = q_{1}^{T} \cdot LeakyReLU (W_{1} [s ⊙ h_{v_{j}} ‖ w_{i j}]),

(4)

where LeakyReLU is the activation function,

w_{i j} \in R^{1}

is the weight of the edge

(v_{i}, v_{j})

in the graph,

W_{1} \in R^{(d + 1) \times (d + 1)}

and

q_{1} \in R^{d + 1}

are trainable parameters, ⊙ denotes element-wise multiplication, and ‖ denotes vector concatenation. The session feature representation

s

is computed by averaging the representations of all items in the session:

s = \frac{1}{| S |} \sum_{v_{i} \in S} h_{v_{i}} .

(5)

This process dynamically adjusts the importance of neighboring nodes, assigning higher influence to nodes more relevant to the current session preferences. To ensure interpretability of the weights, we normalize the attention weights of the neighborhood nodes using the softmax function:

π (v_{i}, v_{j}) = \frac{exp (π (v_{i}, v_{j}))}{\sum_{v_{k} \in N_{v_{i}}^{g}} exp (π (v_{i}, v_{k}))},

(6)

After completing the weighted aggregation of neighborhood features, the item representation is formed by combining its own features

h_{v}

and the neighborhood features

h_{N_{v}^{g}}

:

h_{v}^{g} = ReLU (W_{2} [h_{v} ‖ h_{N_{v}^{g}}]),

(7)

where

W_{2} \in R^{d \times 2 d}

is a trainable weight matrix, and ReLU is the activation function used to introduce non-linearity.

To capture high-order neighborhood information, we extend the message propagation and aggregation process to a multi-layer architecture. By stacking multiple propagation layers, the model can integrate contextual information from farther neighbors. The representation of item v at the k-th layer is generated as

h_{v}^{g, (k)} = ReLU (W_{2}^{(k)} [h_{v}^{(k - 1)} ‖ h_{N_{v}^{g}}^{(k - 1)}]),

(8)

where

h_{v}^{(k - 1)}

is the representation of item v generated from the

(k - 1)

-th layer,

h_{N_{v}^{g}}^{(k - 1)}

is the neighborhood representation of v, and

W_{2}^{(k)} \in R^{d \times 2 d}

is the aggregation weight for the k-th layer. At the initial step (

k = 0

), the item representation

h_{v}^{(0)}

is the raw embedding.

Through layer-by-layer aggregation, the model captures high-order structural information of the items, enabling the embedding representation to incorporate not only its own features but also the transition information of the items in the global item graph.

4.2.2. Session-Level Item Embedding Learning Layer

To better capture the dynamic relationships of items in the session graph, we employ a Graph Attention Network (GAT) to learn the embeddings of items within the session graph. In the session graph, item nodes are connected by edges, representing pairwise transition relationships between items in the same session. Through the attention mechanism, the model can dynamically assign importance weights to neighboring nodes, enabling more precise item embeddings.

Specifically, for a given node

v_{i}

in the session graph, the importance weight of its neighbor

v_{j}

is computed using element-wise multiplication and a non-linear transformation:

e_{i j} = LeakyReLU (a_{r_{i j}}^{T} (h_{v_{i}} ⊙ h_{v_{j}})),

(9)

where

e_{i j}

represents the importance weight of neighbor

v_{j}

to node

v_{i}

,

r_{i j}

denotes the relationship between

v_{i}

and

v_{j}

,

a_{*} \in R^{d}

is the weight vector, and ⊙ denotes element-wise multiplication.

For different relationships, we trained four distinct weight vectors:

a_{in}

,

a_{out}

,

a_{in - out}

, and

a_{self}

, corresponding to incoming edges, outgoing edges, bidirectional edges, and self-loops, respectively. To ensure the comparability of weights between nodes, the computed attention weights are normalized using the softmax function:

α_{i j} = \frac{exp (e_{i j})}{\sum_{v_{k} \in N_{v_{i}}^{S}} exp (LeakyReLU (a_{r_{i k}}^{T} (h_{v_{i}} ⊙ h_{v_{k}})))},

(10)

where

α_{i j}

is the normalized attention weight. Due to the asymmetry of neighborhoods,

α_{i j} \neq α_{j i}

, reflecting the asymmetric nature of the attention weights.

After obtaining the attention weights, the final feature representation of node

v_{i}

is generated through the weighted aggregation of its neighbors’ features:

h_{v_{i}}^{s} = \sum_{v_{j} \in N_{v_{i}}^{S}} α_{i j} h_{v_{j}} .

(11)

This process dynamically adjusts the importance of neighboring nodes via the attention mechanism, effectively reducing noise and improving the model’s ability to capture item transition relationships. Additionally, it integrates the information of each node with its neighbors, yielding a local item embedding representation for every node.

4.2.3. Global-Level Category Embedding Learning Layer

To model global transition relationships between categories and enhance the semantic representation for recommendation, we design a global category embedding learning module based on Graph Convolutional Networks (GCNs). This module aims to extract transition information between categories from the global category graph, providing rich contextual support for the recommendation task. The global category graph is constructed by integrating the category transition information across all sessions, similar to the construction of the global item graph.

For any category c, its neighborhood

N_{ϵ} (c)

includes all directly connected category nodes. The neighborhood feature representation of category c is generated through weighted aggregation of the features of neighboring nodes:

h_{N_{c_{i}}^{g}} = \sum_{c_{j} \in N_{c}^{g}} π (c_{i}, c_{j}) h_{c_{j}},

(12)

where

π (c_{i}, c_{j})

is the attention weight representing the importance of different neighbors, and

h_{c_{j}}

is the unified embedding representation of category

c_{j}

. To emphasize the relevance of categories to the semantic preferences in the current session, we compute the attention weight

π (c_{i}, c_{j})

using a session-aware attention mechanism as follows:

π (c_{i}, c_{j}) = q_{1}^{T} \cdot LeakyReLU (W_{1} [s ⊙ h_{c_{j}} ‖ w_{i j}]),

(13)

where

w_{i j}

represents the weight of the edge between

v_{1}

and

v_{2}

,

W_{1} \in R^{(d + 1) \times (d + 1)}

and

q_{1} \in R^{d + 1}

are trainable parameters, ⊙ denotes element-wise multiplication, and ‖ denotes vector concatenation. The session feature

s

is calculated as the average feature of all categories in the session:

s = \frac{1}{| S |} \sum_{c_{j} \in S} h_{c_{j}} .

(14)

To adjust the importance of neighboring nodes and better match the preferences of the current session, the attention weights of neighboring nodes are normalized using the softmax function:

π (c_{i}, c_{j}) = \frac{exp (π (c_{i}, c_{j}))}{\sum_{c_{k} \in N_{c_{i}}^{g}} exp (π (c_{i}, c_{k}))} .

(15)

After completing the weighted aggregation of neighborhood features, the category representation is formed by combining its own features

h_{c}

and the neighborhood features

h_{N_{ϵ} (c)}

. To capture deeper connections between categories, we extend the aggregator to a multi-layer architecture. By stacking multiple aggregation layers, the model integrates information from farther neighbors, enabling it to capture higher-order connectivity and enhance the representation of global category embeddings. The representation of category c at the k-th layer is defined as

h_{c}^{g, (k)} = ReLU (W_{2}^{(k)} [h_{c}^{(k - 1)} ‖ h_{N_{c}^{g}}^{(k - 1)}]),

(16)

where

W_{2}^{(k)} \in R^{d \times 2 d}

is the aggregation weight for the k-th layer, ReLU is the activation function used to introduce non-linear features, and

h_{c}^{(k - 1)}

is generated from the

(k - 1)

-th layer.

Through layer-by-layer aggregation, the model captures high-order structural information of categories, enabling the embeddings to integrate not only their own features but also the transition information between categories in the global category graph. This process enriches the semantic representation of the model.

4.2.4. Contrastive Learning Module

To enhance the robustness and generalization of global item embeddings, we introduce a simple graph contrastive learning module inspired by SimGCL [24]. Specifically, we constructed two different views of item embeddings: one from the original global item graph and one from a perturbed version with random noise added to the normalized embeddings.

Let

{h_{i}}_{i = 1}^{L}

denote the item-level embeddings obtained from the original global item graph, and

{h_{i}^{noised}}_{i = 1}^{L}

denote the embeddings from the perturbed view. Following SimGCL, the perturbed view is computed as

h_{i}^{noised} = Norm (h_{i}) + ϵ \cdot δ, δ \sim N (0, I),

(17)

where

ϵ

is a hyperparameter controlling the noise intensity, and

Norm (\cdot)

represents

L_{2}

normalization. This additive perturbation allows the model to contrast each item with its noisy counterpart without relying on complex graph augmentations.

We compute the InfoNCE-based contrastive loss at the item level to maximize the agreement between the two views:

L_{s s l} = - \sum_{i = 1}^{L} log \frac{exp (sim (h_{i}, h_{i}^{noised}) / τ)}{\sum_{j = 1}^{L} exp (sim (h_{i}, h_{j}^{noised}) / τ)},

(18)

where

h_{i}, h_{j}^{noised} \in R^{d}

are

L_{2}

-normalized item embeddings from two views,

sim (a, b) = a^{⊤} b

denotes the dot-product similarity, and

τ

is a temperature hyperparameter.

This loss encourages each item embedding from the original view to be close to its noisy version while remaining distinguishable from other items, thus improving the stability of the item representation under structural perturbations. The hyperparameter

ϵ

is selected by validation and follows the SimGCL design guidelines.

4.2.5. Embedding Fusion Layer

After obtaining the global embeddings and session embeddings, it is necessary to fuse these pieces of information for the recommendation task. In this work, we adopt an attention-based fusion mechanism to combine the embeddings from three perspectives: the session-level item embedding, the global item embedding, and the global category embedding. This design is motivated by recent advances in multi-view learning [29], where attention has been shown to be an effective and lightweight approach to capture the relative importance of multiple input representations in a data-adaptive way.

Compared with alternatives like cross-attention [30] or gating networks [31], our soft attention mechanism provides a balance between representation flexibility and computational efficiency. Although cross-attention introduces a quadratic complexity and is more suitable for sequence-to-sequence alignment, our design focuses on weighting three fixed-length views. In addition, gating mechanisms typically require manually predefined interactions and are sensitive to feature scale, which may limit robustness in noisy sessions. In contrast, soft attention can learn adaptive weights for each view component in an end-to-end fashion with minimal parameter overhead.

First, we apply dropout to the global item embeddings and global category embeddings, and then perform a sum-pooling operation by combining them with the session embeddings. The formula is as follows:

h_{v}^{*} = SumPooling (dropout (h_{v}^{g, (k)}), dropout (h_{c}^{g, (k)}), h_{v_{i}}^{s}),

(19)

where

h_{v}^{*}

represents the fused item embedding that integrates multi-level information,

h_{v}^{g, (k)}

and

h_{c}^{g, (k)}

are the global item embeddings and global category embeddings, respectively, and

h_{v_{i}}^{s}

is the session embedding. Through this fusion operation, information from different levels is effectively integrated, providing a comprehensive foundation for generating session representations in the subsequent steps.

Considering that items clicked later in the session and noise filtering are usually more valuable for recommendations, we introduce a position-based attention mechanism. First, the sequential session data are input to the graph neural network, from which we obtain the representations of the items in the session, denoted as

H = [h_{v_{1}^{s}}^{*}, h_{v_{2}^{s}}^{*}, \dots, h_{v_{l}^{s}}^{*}] .

Next, we generate an inverse position embedding matrix for all items:

P = [p_{1}, p_{2}, \dots, p_{l}],

where

p_{i} \in R^{d}

represents the inverse position vector of item i. Then, we concatenate the item embeddings with the position embeddings and perform a non-linear transformation to generate the fused position embeddings:

z_{i}^{*} = \tan h (W_{3} [h_{v}^{*} ‖ p_{l - i + 1}] + b_{3}),

(20)

where

W_{3} \in R^{d \times 2 d}

,

b_{3} \in R^{d}

are learnable parameters, and ‖ denotes concatenation. The inverse position information can more accurately reflect the contribution of items to the target recommendation, as items closer to the target item generally have a greater impact on the prediction result.

The average of the session embeddings is obtained by the following formula:

s^{*} = \frac{1}{l} \sum_{i = 1}^{l} h_{v}^{*},

(21)

To further optimize the fused item embeddings, we use a soft attention mechanism to learn the corresponding weights. The attention score for each item is calculated as

β_{i} = q_{2}^{⊤} ReLU (W_{4} z_{i}^{*} + W_{5} s^{*} + c),

(22)

where

W_{4}, W_{5} \in R^{d \times d}

, and

c \in R^{d}

are learnable parameters. To address the vanishing gradient problem, the activation function is chosen to be ReLU.

Finally, the session embedding is generated by the following formula:

S = \sum_{i = 1}^{l} β_{i} h_{v}^{*},

(23)

where S is the final representation of the current session, which integrates global item embeddings, global category embeddings, and session embeddings, while also incorporating position and sequential information. Through this mechanism, the model is able to comprehensively consider the global transition patterns between items, category associations, and session context relationships, leading to more accurate recommendation results.

4.2.6. Prediction Layer

After generating the session representation S, the Prediction Layer is responsible for calculating the recommendation probability of each item based on the candidate item’s initial embedding and the session representation. Specifically, by performing a dot-product operation between the session representation and each item’s embedding, and combining it with the softmax function, the recommendation probability is obtained:

\hat{y_{i}} = softmax (S^{⊤} h_{v_{i}}),

(24)

where

\hat{y_{i}}

represents the probability that item

v_{i}

will be selected in the current session,

h_{v_{i}}

is the embedding representation of item

v_{i}

, and S is the session representation.

The model’s optimization objective uses the cross-entropy loss function, which is defined as

L_{rec} (\hat{y}) = - \sum_{i = 1}^{m} y_{i} log (\hat{y_{i}}) + (1 - y_{i}) log (1 - \hat{y_{i}}),

(25)

where y is the one-hot encoded vector of the target item,

y_{i}

indicates whether item

v_{i}

is the next clicked item, and m is the total number of candidate items.

To further enhance the model’s robustness, the Prediction Layer incorporates a self-supervised contrastive learning (SSL) loss as an auxiliary optimization objective for the recommendation task. The overall loss function is defined as

L = L_{rec} + λ L_{ssl},

(26)

where

λ

is a hyperparameter that balances the recommendation loss and the contrastive learning loss. Through this joint optimization objective, the model can learn from both supervised signals and self-supervised signals, thereby improving the accuracy and robustness of the recommendation results.

5. Experiments

To systematically validate the effectiveness of the MVGCL-GNN framework, we designed a comprehensive experimental protocol addressing five critical research questions:

RQ1: Does MVGCL-GNN achieve superior recommendation accuracy compared to state-of-the-art session-based recommendation (SBR) baselines?
RQ2: How do the global graph with global-level encoder and the category graph with category-aware encoder contribute to performance improvement? How does MVGCL-GNN perform with varying neighborhood propagation depths (k)?
RQ3: What quantitative impact do positional vector embeddings have on the model’s representational capability?
RQ4: How does MVGCL-GNN perform under different aggregation operations?
RQ5: To what extent do regularization hyperparameters (particularly dropout rates) influence the accuracy of MVGCL-GNN?

5.1. Experimental Setup

5.1.1. Dataset

This study uses two publicly available e-commerce datasets for experiments, namely Tmall and Nowplaying. The Tmall dataset is sourced from the IJCAI-15 competition and contains shopping log data from anonymous users on the Tmall platform, where each item is associated with category labels. The Nowplaying dataset describes users’ music listening behaviors and comes from [32]. This dataset also includes category information for songs. Both datasets have been widely used in research on session-based recommendation (SBR) systems.

To ensure fairness in the experiments, the same preprocessing steps [7,8] were applied to both datasets. Specifically, sessions with a length of 1 were filtered out, as well as items that appeared fewer than 5 times. Category information was also considered, and items without category labels were filtered out. Then, data from the previous week were used as the test set, with the remaining data used for training. Additionally, for each session data

S = [s_{1}, s_{2}, \dots, s_{n}]

and the corresponding category sequence

C = [c_{1}, c_{2}, \dots, c_{n}]

, training and test sequence pairs with labels were generated through sequence splitting. For example,

([s_{1}], s_{2}, [c_{1}], c_{2}), ([s_{1}, s_{2}], s_{3}, [c_{1}, c_{2}], c_{3}), \dots, ([s_{1}, s_{2}, \dots, s_{n - 1}], s_{n}, [c_{1}, c_{2}, \dots, c_{n - 1}], c_{n}) .

Table 2 shows the statistics of each dataset after preprocessing.

5.1.2. Baselines

To evaluate the effectiveness of our proposed MVGCL-GNN model, we compared it with the following classic models and state-of-the-art deep learning recommendation algorithms:

POP: It recommends the most frequent items in the training set, serving as a simple baseline recommendation model commonly used to compare the performance of more complex models.
Item-KNN [1]: Based on the similarity between items, this model recommends items by calculating the cosine similarity between items in the current session and items in other sessions. It is a traditional collaborative filtering-based recommendation method.
FPMC [16]: It combines matrix factorization with a first-order Markov chain to simultaneously capture the sequential effects and preferences of users. In this model, the latent representations of users are ignored, and the recommendation score is computed solely based on item and session information.
STAMP [17]: This method introduces a short-term attention mechanism to replace traditional RNN models, relying on self-attention on the last item in the current session to capture users’ short-term interests, addressing the limitations of traditional methods in capturing short-term preferences.
NARM [5]: An RNN-based deep learning method that combines an attention mechanism to weigh the sequential behavioral features of users. This model effectively captures both long-term and short-term preferences of users and improves recommendation performance by learning contextual information within sequences.
GRU4Rec [4]: The first session-based recommendation model based on RNNs, using gated recurrent units (GRUs) to model user behavior sequences. It captures the temporal characteristics of user behavior and is suitable for sequential recommendation scenarios.
SR-GNN [7]: The first graph neural network (GNN)-based model, which uses gated GNN layers to obtain item embeddings and calculates session-level embeddings via self-attention on the last item in the session. This model captures both global and current user preferences, improving upon the limitations of traditional RNN methods.
CSRM [33]: This method uses a memory network to model multiple recent sessions and improves the prediction accuracy of the current session by considering changes in user intent. It effectively captures complex information in multi-turn sessions.
GCE-GNN [10]: This model combines graph neural networks, employing a two-level embedding learning structure with a global graph and a session graph. It aggregates item representations through position embeddings, enabling better capture of both global and local information to enhance recommendation accuracy.

5.1.3. Evaluation Metrics

To evaluate the performance of the model, we referred to previous studies [7,9,10,20] and adopted two widely used Top-K recommendation evaluation metrics: P@K (Precision) and MRR@K (Mean Reciprocal Rank).

5.1.4. Parameter Settings

To ensure a fair comparison across all models, we adopted the same hyperparameter settings as previous studies [7,10,14,20]. Specifically, the dimension of the latent vectors was fixed to 100, and the mini-batch size was set to 100 for all models. For the CSRM model, the memory size was also set to 100, consistent with the batch size. For the FGNN model, the number of GNN layers was set to 3, with 8 heads for the multi-head attention mechanism. All model parameters were initialized using a Gaussian distribution with a mean of 0 and a standard deviation of 0.1. In terms of optimization, we used the Adam optimizer with an initial learning rate of 0.001, which decays by a factor of 0.1 every 3 epochs. To prevent overfitting, an L2 regularization term was applied with a penalty coefficient of

10^{- 5}

. Additionally, the dropout rate was searched over the range

{0.1, 0.2, \dots, 0.9}

and the best value was selected based on performance on a validation set, which is a random 10% subset of the training set. For graph construction, the number of neighbors was set to 12, meaning that each item had a maximum of 12 neighbors. The maximum distance (

ϵ

) between adjacent items was set to 3, controlling the range of information propagation in the graph.

5.2. Comparison with Baselines (RQ1)

To validate the overall performance of MVGCL-GNN, we compared it with several representative baseline models. Table 3 shows the experimental results in terms of metrics such as P@10 and M@20, with the best results highlighted in bold. It can be observed that MVGCL-GNN achieved significant improvements in most metrics across all datasets, demonstrating its effectiveness and superiority.

As shown in Table 3, MVGCL-GNN outperformed other methods on the Tmall and Nowplaying datasets, particularly in P@10 and M@20, where it significantly led. This result indicates that MVGCL-GNN is effective in capturing the complex dependencies between user behaviors and items.

Traditional methods typically perform poorly. POP and Item-KNN are early traditional recommendation methods, while FPMC makes recommendations based on Markov chains. Their performance is lower than that of MVGCL-GNN, mainly because they do not leverage advanced deep learning networks and cannot effectively model complex user interest patterns. In contrast, MVGCL-GNN introduces category graphs and noise addition to the global graph, and optimizes the global graph representation using simple graph contrastive learning (SimGCL), significantly improving recommendation performance.

Compared with traditional methods, recent deep learning approaches exhibit stronger performance in capturing complex user behaviors. Although GRU4Rec performs worse than Item-KNN on the Diginetica and Nowplaying datasets, it still demonstrates the effectiveness of RNNs in sequence modeling. However, GRU4Rec only focuses on the relationships between items in the sequence, neglecting changes in user preferences, which is why it performs worse than NARM and STAMP. NARM is an RNN-based sequential model that considers unidirectional transitions between items, while STAMP uses an attention mechanism and multi-layer perceptron (MLP) networks to capture both global and current user preferences. Despite STAMP’s enhancement through self-attention mechanisms, it still fails to comprehensively model complex inter-session dependencies.

Next, the CSRM method surpassed NARM and STAMP on the Diginetica and Tmall datasets, indicating that considering item transition information from other sessions can significantly improve recommendation performance. However, compared to MVGCL-GNN, CSRM still fails to fully capture global context information. The advantage of MVGCL-GNN lies in the addition of global category graphs and the optimization of global graph embeddings through simple graph contrastive learning (SimGCL). In this way, MVGCL-GNN can better model complex relationships between items and changes in user preferences, further improving recommendation accuracy. Compared to other GNN methods such as SR-GNN and GCE-GNN, MVGCL-GNN significantly enhances the representational power and information richness of the global graph, thus achieving superior performance on the Tmall and Nowplaying datasets.

5.3. The Impact of Different Graph Embeddings (RQ2)

In this section, we present the experiments conducted on two datasets to evaluate the effectiveness of the global item graph, global category graph, and session graph embeddings in the MVGCL-GNN model. We designed four contrast models to analyze the impact of removing specific graph embeddings:

MVGCL-GNN w/o session: MVGCL-GNN without the session graph embedding, relying only on global item graph and global category graph embeddings.
MVGCL-GNN w/o item_g: MVGCL-GNN without the global item graph embedding, using only session graph and global category graph embeddings.
MVGCL-GNN w/o category_g: MVGCL-GNN without the global category graph embedding, using only session graph and global item graph embeddings.
MVGCL-GNN: The full model, which includes all graph embeddings: session graph, global item graph, and global category graph.

Table 4 shows the comparison between these contrast models. The results indicate that including both session and global graph embeddings significantly improves the model’s performance. Specifically, removing the session graph leads to a significant performance drop, suggesting that session-level graph embeddings are crucial for capturing fine-grained item transitions within sessions.

Additionally, removing the global item graph reduces the model’s ability to capture global item co-occurrence relationships, which negatively affects recommendation accuracy. However, the performance drop due to removing the global category graph is relatively smaller. It is worth noting that when the global category graph is included, the model’s performance significantly improves. Incorporating category information helps the model better understand item attribute relationships, thereby providing more contextual information during the recommendation process, which enhances both accuracy and stability.

From the results on the Tmall and Nowplaying datasets, it is evident that MVGCL-GNN outperforms all other contrast models in terms of both P@20 and MRR@20, highlighting the importance of considering category information for improving model performance in recommendation systems. Removing the category graph embedding results in a decline in recommendation accuracy, further proving the critical role of category information in global graph embeddings.

5.4. Impact of Position Vector (RQ3)

In the MVGCL-GNN model, the position vector is used to help the model understand the contribution of each item in a session. Although earlier models, such as SASRec, introduced forward position vectors to improve recommendation performance, we believe that the forward position vector may have limited effectiveness for session-based recommendation (SBR) tasks. To investigate this, we designed several contrast models:

MVGCL-GNN-NP: This variant replaces the reverse position vector with a forward position vector.
MVGCL-GNN-SA: This version substitutes position-aware attention with self-attention mechanisms.

Table 5 presents the performance of the three models across different datasets. Our experiments show that MVGCL-GNN, which incorporates reversed position embedding, outperforms the other two variants.

MVGCL-GNN-NP performs poorly on both datasets. This is likely due to the failure of the forward position vector to effectively capture the item distance within sessions, leading to poorer recommendations, especially in sessions with varying lengths.

MVGCL-GNN-SA performs better than MVGCL-GNN-NP, highlighting the importance of self-attention, which can better weigh the relevance of items. However, despite the improvement, it still does not surpass MVGCL-GNN, as it lacks a nuanced understanding of item position, which is crucial for better session modeling.

When comparing all three models, the reversed position embedding in MVGCL-GNN proves to be more effective. The combination of the reversed position vector and the attention mechanism allows the model to more accurately capture the importance of each item in a session, particularly in sessions with varying lengths. This combination improves the model’s performance, as the attention mechanism helps filter out irrelevant information.

5.5. Impact of Dropout (RQ4)

To prevent the model from overfitting, we adopted dropout as a regularization strategy [34]. This technique has been proven effective in various neural network architectures [35,36], including graph neural networks. The core idea of dropout is to randomly drop a portion of neurons with probability during training, while using all neurons during testing. Figure 4 shows the impact of dropout in Equation (18) on the Nowplaying and Tmall datasets. From the figure, it can be observed that when the dropout rate is low, the model performs poorly, indicating that the data are prone to overfitting. As the dropout rate increases, the model’s performance gradually improves, reaching its optimal value at approximately 0.3 for the Nowplaying dataset and 0.5 for the Tmall dataset. However, when the dropout rate continues to increase, the model’s performance starts to decline, possibly due to the reduced number of available neurons, which limits its learning capacity and negatively affects recommendation quality.

5.6. Model Complexity and Efficiency Analysis (RQ5)

Although MVGCL-GNN incorporates multiple graph encoders and a contrastive learning module, its computational complexity remains manageable due to the modular design and shared parameterization across graph layers.

Specifically, each graph encoder (session, global item, and global category) adopts lightweight GNN architectures with a limited number of layers (typically 2 or 3), which ensures linear time complexity with respect to the number of nodes and edges in each graph. Since the graphs are preconstructed and relatively sparse (as each node retains only top-k neighbors), message passing within the graphs is efficient and parallelizable.

In addition, the contrastive learning module is implemented via SimGCL, which avoids costly graph augmentations and only requires a single forward–backward pass with noise injection. This design significantly reduces training overhead compared to other contrastive GNNs such as GraphCL or GCC.

From a memory perspective, all graph modules operate independently with separate adjacency matrices, allowing efficient memory allocation during training. Batch-wise processing and shared embedding layers further contribute to memory reuse and optimization.

While our model is more complex than single-view GNN methods such as SR-GNN or GCE-GNN, its additional components introduce only a linear increase in parameters and computation. The overall architecture remains scalable to large-scale datasets, as demonstrated in our experiments on the Tmall and Nowplaying datasets, which contain 40,728 and 60,417 unique items, respectively.

We acknowledge the importance of empirical runtime comparisons. We are currently unable to provide direct measurements of training and inference times due to device limitations. Nonetheless, the proposed theoretical analysis and architectural design work together to show that MVGCL-GNN is computationally efficient and can be extended to session-based recommendation tasks in the real world. In future work, we plan to provide a more detailed runtime and memory profiling under different configurations and compare the training efficiency across models.

6. Conclusions and Future Work

In this paper, we propose MVGCL-GNN, a novel session-based recommendation model that integrates multi-view graph construction and simple graph contrastive learning. Specifically, our model builds three types of graph structures—session graph, global item graph, and global category graph—to capture the complex dependencies between items at different semantic levels. We further incorporate a soft attention-based fusion module with position-aware embeddings to generate effective session representations.

To enhance model robustness and alleviate the influence of noise, we adopt a SimGCL-based contrastive learning strategy, which efficiently improves the quality of item embeddings without requiring complex data augmentation. Extensive experiments conducted on two real-world datasets, Tmall and Nowplaying, demonstrate that our model consistently outperforms several state-of-the-art baselines across multiple metrics. The ablation study confirms the individual contributions of each component, including the session graph, global-level graphs, positional embeddings, and the contrastive learning module.

While MVGCL-GNN demonstrates strong performance on e-commerce and music datasets (Tmall and Nowplaying), we acknowledge that the model’s generalizability to other domains such as news, social media, or cold-start scenarios remains unexplored in this study. Nevertheless, the modular architecture of MVGCL-GNN—particularly the separate modeling of global item transitions, category semantics, and session-level behaviors—makes it highly adaptable to other session-based recommendation tasks. In future work, we plan to evaluate MVGCL-GNN on diverse datasets and investigate its robustness under sparse or cold-start conditions, further extending its applicability in real-world recommendation systems. Another limitation of the model is that the constructed graph structures—session graph, global item graph, and global category graph—are all static and do not evolve with time or user behavior. However, in real-world scenarios, user preferences and item relationships are often dynamic and context-dependent. As part of our future work, we plan to explore dynamic graph modeling techniques that can incrementally update graph structures based on recent user interactions. This would enable MVGCL-GNN to better capture evolving session patterns and improve recommendation accuracy in real-time or streaming environments.

Author Contributions

Methodology, software, Writing—original draft, P.H.; Writing—review & editing, C.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Science and Technology Development Fund, Mcao SAR. Grant number: 0004/2023/ITP1.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Sarwar, B.; Karypis, G.; Konstan, J.; Riedl, J. Item-based collaborative filtering recommendation algorithms. In Proceedings of the 10th International Conference on World Wide Web, Hong Kong, China, 1–5 May 2001; pp. 285–295. [Google Scholar]
Wang, S.; Cao, L.; Wang, Y.; Sheng, Q.Z.; Orgun, M.A.; Lian, D. A survey on session-based recommender systems. ACM Comput. Surv. CSUR 2021, 54, 154. [Google Scholar] [CrossRef]
Shani, G.; Heckerman, D.; Brafman, R.I.; Boutilier, C. An MDP-based recommender system. J. Mach. Learn. Res. 2005, 6, 1265–1295. [Google Scholar]
Hidasi, B. Session-based Recommendations with Recurrent Neural Networks. arXiv 2015, arXiv:1511.06939. [Google Scholar]
Li, J.; Ren, P.; Chen, Z.; Ren, Z.; Lian, T.; Ma, J. Neural attentive session-based recommendation. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, Singapore, 6–10 November 2017; pp. 1419–1428. [Google Scholar]
Sanchez-Lengeling, B.; Reif, E.; Pearce, A.; Wiltschko, A.B. A gentle introduction to graph neural networks. Distill 2021, 6, e33. [Google Scholar] [CrossRef]
Wu, S.; Tang, Y.; Zhu, Y.; Wang, L.; Xie, X.; Tan, T. Session-based recommendation with graph neural networks. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019; Volume 33, pp. 346–353. [Google Scholar]
Xu, C.; Zhao, P.; Liu, Y.; Sheng, V.S.; Xu, J.; Zhuang, F.; Fang, J.; Zhou, X. Graph contextualized self-attention network for session-based recommendation. In Proceedings of the International Joint Conference on Artificial Intelligence, Macao, China, 10–16 August 2019; Volume 19, pp. 3940–3946. [Google Scholar]
Qiu, R.; Li, J.; Huang, Z.; Yin, H. Rethinking the item order in session-based recommendation with graph neural networks. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management, Beijing, China, 3–7 November 2019; pp. 579–588. [Google Scholar]
Wang, Z.; Wei, W.; Cong, G.; Li, X.L.; Mao, X.L.; Qiu, M. Global context enhanced graph neural networks for session-based recommendation. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, Virtual Event, China, 25–30 July 2020; pp. 169–178. [Google Scholar]
Chen, Y.; Xiong, Q.; Guo, Y. Session-based recommendation: Learning multi-dimension interests via a multi-head attention graph neural network. Appl. Soft Comput. 2022, 131, 109744. [Google Scholar] [CrossRef]
Hjelm, R.D.; Fedorov, A.; Lavoie-Marchildon, S.; Grewal, K.; Bachman, P.; Trischler, A.; Bengio, Y. Learning deep representations by mutual information estimation and maximization. arXiv 2018, arXiv:1808.06670. [Google Scholar]
Zhou, K.; Wang, H.; Zhao, W.X.; Zhu, Y.; Wang, S.; Zhang, F.; Wang, Z.; Wen, J.R. S3-rec: Self-supervised learning for sequential recommendation with mutual information maximization. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management, Virtual Event, Ireland, 19–23 October 2020; pp. 1893–1902. [Google Scholar]
Xia, X.; Yin, H.; Yu, J.; Wang, Q.; Cui, L.; Zhang, X. Self-supervised hypergraph convolutional networks for session-based recommendation. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtual, 19–21 May 2021; Volume 35, pp. 4503–4511. [Google Scholar]
Jannach, D.; Ludewig, M. When recurrent neural networks meet the neighborhood for session-based recommendation. In Proceedings of the Eleventh ACM Conference on Recommender Systems, Como, Italy, 27–31 August 2017; pp. 306–310. [Google Scholar]
Rendle, S.; Freudenthaler, C.; Schmidt-Thieme, L. Factorizing personalized markov chains for next-basket recommendation. In Proceedings of the 19th International Conference on World Wide Web, Raleigh, NC, USA, 26–30 April 2010; pp. 811–820. [Google Scholar]
Liu, Q.; Zeng, Y.; Mokhosi, R.; Zhang, H. STAMP: Short-term attention/memory priority model for session-based recommendation. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, London, UK, 19–23 August 2018; pp. 1831–1839. [Google Scholar]
Pan, Z.; Cai, F.; Ling, Y.; de Rijke, M. Rethinking item importance in session-based recommendation. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, Virtual Event, China, 25–30 July 2020; pp. 1837–1840. [Google Scholar]
Yuan, J.; Song, Z.; Sun, M.; Wang, X.; Zhao, W.X. Dual sparse attention network for session-based recommendation. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtual, 19–21 May 2021; Volume 35, pp. 4635–4643. [Google Scholar]
Liu, L.; Wang, L.; Lian, T. CaSe4SR: Using category sequence graph to augment session-based recommendation. Knowl.-Based Syst. 2021, 212, 106558. [Google Scholar] [CrossRef]
Xu, H.; Yang, B.; Liu, X.; Fan, W.; Li, Q. Category-aware multi-relation heterogeneous graph neural networks for session-based recommendation. Knowl.-Based Syst. 2022, 251, 109246. [Google Scholar] [CrossRef]
Cai, R.; Wu, J.; San, A.; Wang, C.; Wang, H. Category-aware collaborative sequential recommendation. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, Virtual, 11–15 July 2021; pp. 388–397. [Google Scholar]
Chen, R.; Ma, P.; Zhu, Y.; Chen, Q. Category-aware Graph Neural Network for Session-based Recommendation. In Proceedings of the 2022 IEEE 28th International Conference on Parallel and Distributed Systems (ICPADS), Nanjing, China, 10–12 January 2023; IEEE: Piscataway, NJ, USA; pp. 891–899. [Google Scholar]
Yu, J.; Yin, H.; Xia, X.; Chen, T.; Cui, L.; Nguyen, Q.V.H. Are graph augmentations necessary? Simple graph contrastive learning for recommendation. In Proceedings of the 45th international ACM SIGIR Conference on Research and Development in Information Retrieval, Madrid, Spain, 11–15 July 2022; pp. 1294–1303. [Google Scholar]
Wu, J.; Wang, X.; Feng, F.; He, X.; Chen, L.; Lian, J.; Xie, X. Self-supervised graph learning for recommendation. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, Virtual, 11–15 July 2021; pp. 726–735. [Google Scholar]
Shi, Z.; Wang, X.; Lipani, A. Self contrastive learning for session-based recommendation. In Proceedings of the European Conference on Information Retrieval, Glasgow, UK, 24–28 March 2024; Springer: Berlin/Heidelberg, Germany, 2024; pp. 3–20. [Google Scholar]
Jin, D.; Wang, L.; Zhang, H.; Zheng, Y.; Ding, W.; Xia, F.; Pan, S. A survey on fairness-aware recommender systems. Inf. Fusion 2023, 100, 101906. [Google Scholar] [CrossRef]
Zhang, H.; Wu, B.; Yuan, X.; Pan, S.; Tong, H.; Pei, J. Trustworthy graph neural networks: Aspects, methods, and trends. Proc. IEEE 2024, 112, 97–139. [Google Scholar] [CrossRef]
Xu, C.; Tao, D.; Xu, C. A survey on multi-view learning. arXiv 2013, arXiv:1304.5634. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 5998–6008. [Google Scholar]
Dauphin, Y.N.; Fan, A.; Auli, M.; Grangier, D. Language modeling with gated convolutional networks. In Proceedings of the International Conference on Machine Learning, PMLR, Sydney, Australia, 6–11 August 2017; pp. 933–941. [Google Scholar]
Zangerle, E.; Pichl, M.; Gassler, W.; Specht, G. # nowplaying music dataset: Extracting listening behavior from twitter. In Proceedings of the First International Workshop on Internet-Scale Multimedia Management, Orlando, FL, USA, 7 November 2014; pp. 21–26. [Google Scholar]
Wang, M.; Ren, P.; Mei, L.; Chen, Z.; Ma, J.; De Rijke, M. A collaborative session-based recommendation approach with parallel memory modules. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval, Paris, France, 21–25 July 2019; pp. 345–354. [Google Scholar]
Srivastava, N.; Hinton, G.; Krizhevsky, A.; Sutskever, I.; Salakhutdinov, R. Dropout: A simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 2014, 15, 1929–1958. [Google Scholar]
Wang, X.; He, X.; Wang, M.; Feng, F.; Chua, T.S. Neural graph collaborative filtering. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval, Paris, France, 21–25 July 2019; pp. 165–174. [Google Scholar]
Wang, X.; Ji, H.; Shi, C.; Wang, B.; Ye, Y.; Cui, P.; Yu, P.S. Heterogeneous graph attention network. In Proceedings of the World Wide Web Conference, San Francisco, CA, USA, 13–17 May 2019; pp. 2022–2032. [Google Scholar]

Figure 1. Illustrations of session graph and global item graph.

Figure 2. An example of global category graph.

Figure 3. An overview of the proposed framework. The model constructs a global item graph and a global category graph to capture item co-occurrence and category-level associations. A session graph is also generated for each session to model short-term dependencies. The framework employs three embedding learning layers: a Global-level Category Embedding Learning Layer to extract category representations, a Global-level Item Embedding Learning Layer with contrastive learning to refine item embeddings, and a Session-level Item Embedding Learning Layer to model session-specific item interactions. The learned embeddings are fused in the Embedding Fusion Layer, where an Attention Network assigns importance weights. Finally, the Prediction Layer ranks candidate items based on the fused representations.

Figure 4. Impact of different dropout settings on model performance.

Table 1. Comparison of models.

Model	Graph-Based	Multi-View	Category-Aware	Contrastive Learning
GRU4Rec	✗	✗	✗	✗
NARM	✗	✗	✗	✗
STAMP	✗	✗	✗	✗
SR-GNN	✓	✗	✗	✗
GCE-GNN	✓	✓	✗	✗
CaSe4SR	✓	✗	✓	✗
CM-HGNN	✓	✗	✓	✗
SGL	✗	✗	✗	✓
MVGCL-GNN (Ours)	✓	✓	✓	✓

Table 2. Statistics of the datasets.

Datasets	Tmall	Nowplaying
# of training sessions	351,268	825,304
# of test sessions	255,898	89,824
# of items	40,728	60,417
# of categories	711	1146
Average length	6.69	7.42

Table 3. Effectiveness comparison on two datasets.

Datasets	Tmall				Nowplaying
Metrics	P@10	M@10	P@20	M@20	P@10	M@10	P@20	M@20
POP	1.67	0.88	2.00	0.90	1.86	0.83	2.28	0.86
Item-KNN	6.65	3.11	9.15	3.31	10.96	4.55	15.94	4.91
FPMC	13.10	7.12	16.06	7.32	5.28	2.68	7.36	2.82
GRU4Rec	9.47	5.78	10.93	5.89	6.74	4.40	7.92	4.48
NARM	19.17	10.42	23.30	10.70	13.6	6.62	18.59	6.93
STAMP	22.63	13.12	26.47	13.36	13.22	6.57	17.66	6.88
CSRM	24.54	13.62	29.46	13.96	13.20	6.08	18.14	6.42
GCE-GNN	28.01	15.08	33.42	15.42	16.94	8.03	22.37	8.40
MVGCL-GNN	29.34	16.11	34.96	16.50	17.11	8.22	22.59	8.60

Note: In the table, data in bold indicates optimal performance.

Table 4. Performance impact of different graph embeddings in MVGCL-GNN.

Dataset	Tmall		Nowplaying
Measures	P@20	MRR@20	P@20	MRR@20
w/o session	15.52	7.17	20.25	8.02
w/o item_g	33.86	15.19	23.00	7.62
w/o category_g	32.94	15.20	22.33	8.58
MVGCL-GNN	34.96	16.50	22.59	8.60

Table 5. The performance of contrast models.

Dataset	Tmall		Nowplaying
Measures	P@20	MRR@20	P@20	MRR@20
MVGCL-GNN-NP	33.37	15.50	20.19	6.62
MVGCL-GNN-SA	30.63	14.15	19.39	7.32
MVGCL-GNN	34.96	16.50	22.59	8.60

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Huang, P.; Wang, C. Multi-View Graph Contrastive Neural Networks for Session-Based Recommendation. Mathematics 2025, 13, 1530. https://doi.org/10.3390/math13091530

AMA Style

Huang P, Wang C. Multi-View Graph Contrastive Neural Networks for Session-Based Recommendation. Mathematics. 2025; 13(9):1530. https://doi.org/10.3390/math13091530

Chicago/Turabian Style

Huang, Pengbo, and Chun Wang. 2025. "Multi-View Graph Contrastive Neural Networks for Session-Based Recommendation" Mathematics 13, no. 9: 1530. https://doi.org/10.3390/math13091530

APA Style

Huang, P., & Wang, C. (2025). Multi-View Graph Contrastive Neural Networks for Session-Based Recommendation. Mathematics, 13(9), 1530. https://doi.org/10.3390/math13091530

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Multi-View Graph Contrastive Neural Networks for Session-Based Recommendation

Abstract

1. Introduction

2. Related Work

2.1. Session-Based Recommendation

2.2. Category-Aware Recommendation

2.3. Self-Supervised Learning

2.4. Fairness and Trustworthiness in Graph-Based Recommendation

2.5. Limitations of Previous Work

3. Preliminaries

3.1. Problem Statement

3.2. Global Category Graph, Global Item Graph and Session Graph

3.2.1. Global Item Graph

3.2.2. Session Graph

3.2.3. Global Category Graph

4. Proposed Method

4.1. Methodological Rationale

4.2. Model Overview

4.2.1. Global-Level Item Embedding Learning Layer

4.2.2. Session-Level Item Embedding Learning Layer

4.2.3. Global-Level Category Embedding Learning Layer

4.2.4. Contrastive Learning Module

4.2.5. Embedding Fusion Layer

4.2.6. Prediction Layer

5. Experiments

5.1. Experimental Setup

5.1.1. Dataset

5.1.2. Baselines

5.1.3. Evaluation Metrics

5.1.4. Parameter Settings

5.2. Comparison with Baselines (RQ1)

5.3. The Impact of Different Graph Embeddings (RQ2)

5.4. Impact of Position Vector (RQ3)

5.5. Impact of Dropout (RQ4)

5.6. Model Complexity and Efficiency Analysis (RQ5)

6. Conclusions and Future Work

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI