Enhanced Graph Diffusion Learning with Transformable Patching via Curriculum Contrastive Learning for Session Recommendation

Li, Jin; Gao, Rong; Yan, Lingyu; Yao, Quanfeng; Peng, Xianjun; Hu, Jiwei

doi:10.3390/electronics14102089

Open AccessArticle

Enhanced Graph Diffusion Learning with Transformable Patching via Curriculum Contrastive Learning for Session Recommendation

by

Jin Li

¹,

Rong Gao

^1,*

,

Lingyu Yan

¹

,

Quanfeng Yao

²,

Xianjun Peng

² and

Jiwei Hu

²

¹

School of Computer Science, Hubei University of Technology, Wuhan 430068, China

²

China–Chile ICT Belt and Road Joint Laboratory, Wuhan 430205, China

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(10), 2089; https://doi.org/10.3390/electronics14102089

Submission received: 13 April 2025 / Revised: 15 May 2025 / Accepted: 15 May 2025 / Published: 21 May 2025

Download

Browse Figures

Versions Notes

Abstract

The fusion modeling of intra-session item information representation and inter-session item transition pattern for session recommendation has shown performance advantages. However, existing research still suffers from the following challenges: (1) the time-varying effects of complex relationships between item transitions within sessions need to be deeply explored; and (2) the lack of effective representation for inter-session item transition patterns. To address these challenges, we propose a new session recommendation, named EGDLTP-CCL. Specifically, we first design a patch-enhanced gated neural network representation of session item transition patterns, which accurately captures the time-dynamically varying impacts of the complex relationships within sessions of item transitions through a designed transformer patching strategy. Then, we develop an energy-constraint-based graph diffusion model to capture the inter-session item transition patterns, which mitigates the problem of poor simulation of real inter-session item transition patterns by the introduction of an energy-constraint strategy for the graph diffusion model. In addition, patch-enhanced gated neural networks and energy-constrained graph diffusion models are treated as two different views in the contrastive learning framework. By introducing a curriculum learning strategy that explores how to effectively select and train negative samples in a contrastive learning framework, thereby deeply improving performance in contrastive learning task. Finally, we combine and jointly train the recommendation task and the curriculum learning contrastive learning task for optimization based on a multi-task learning strategy to further improve the recommendation performance. Experiments on real-world datasets show that EGDLTP-CCL significantly outperforms state-of-the-art methods.

Keywords:

graph diffusion; transformer; contrastive learning; recommendation

1. Introduction

Session recommendation models a series of items interacted by a user as a pattern of several session sequences, which can generate efficient personalized recommendations for the user relying only on a limited sequence of items interacted by an anonymous user. Therefore, in recent years, session recommendation has attracted the attention of more and more research scholars [1,2].

Traditional approaches focus on using Markov chains to generate recommendations, but are limited by strong independence assumptions [3]. Recently, deep learning has provided effective solutions to these problems. While these approaches [4,5] have been successful in some ways, they ignore the potentially complex transformation information between items. Encouraged by the great success of graph neural networks in the field of artificial intelligence, several recent studies have focused on graph structures, which aggregate much richer information about the current session. GCE-GNN [6] learns global representations of items outside the session using a graph attention mechanism on a global item set. SR-HGNN [7] proposes a higher-order hybrid gated network for session recommendation. All these approaches using graph neural network architecture have achieved very good performance, but they still face several challenges as follows:

(1): Several research works [2,6,8] explore information transformation patterns between sessions to further improve recommendation performance through graph neural network techniques. Although these approaches have achieved very good performance, there may be isolated nodes or redundant structures in the graph structure that construct the inter-session global item transformation patterns [9]. As a result, it is difficult for convolutional operations based on graph structures to accurately capture the true item transition relationships between sessions [10]. Several studies have introduced the idea of diffusion [11] into the learning process of graph structures [12]. However, in the diffusion modeling process based on graph structure, due to the influence of noise and other factors, it may lead to a deviation between the graph structure diffusion learning representation and the real value, which cannot accurately characterize the real item transition pattern between sessions.
(2): The dynamic change characteristics based on time evolution between sequences of items in a session lack in-depth mining. In recent years, some researchers have proposed different temporal encoding functions [13,14] to learn specific temporal change patterns between items. Although these approaches have achieved some success, they essentially default to sessions consisting of items that users interact with based on their temporal ordering properties, ignoring the effect of asynchronous timestamps and future timestamp in the sequential data on the pattern of inter-session item transformations, and lacking in-depth mining of temporal influence patterns [15].

To address these issues, inspired by the literature [16,17], we propose a novel recommendation model for session recommendation (EGDLTP-CCL). First, we construct a global item co-graph based on all the sessions, and then use an energy-constrained graph diffusion-based algorithm to learn the global representation of item transition patterns. The energy-constraint-based L2-weighted diffusion function is used to guide the direction of diffusion information transfer for layer-by-layer regularization, which mitigates the information loss in multiple iterative aggregations of the first-order neighbors of a node, achieving the integration of the information from multi-hop neighboring nodes and accurately characterizing the item transition patterns between sessions. Meanwhile, we design a patch-enhanced gated neural network consisting of GGNN and an LSTM neural network to characterize the dynamic change patterns of items within sessions. Inspired by the literature [17], we innovatively introduce the transformer patch strategy to model the time-varying characteristics of item change patterns in a session. By splitting the time series into sub-sequence-level patches and then aggregating them into item sequence-level patches based on the time step, which focuses more on the local semantics and captures the global semantics. In addition, by using the energy-constrained graph diffusion model and the patch-enhanced hybrid neural network model as two different views, the mutual information between these two view representations is maximized. At the same time, the appropriate negative samples of the curriculum contrastive learning framework are accurately selected through curriculum learning, which achieves the purpose of further improving the performance of the contrastive learning task. Finally, the performance of the recommendation model is improved by integrating and jointly training the recommendation task and the contrastive learning task through a multitasking strategy. Comprehensive experiments demonstrate the superior performance of EGDLTP-CCL.

In summary, the contributions of this paper can be summarized as follows:

1.: The EGDLTP-CCL model employs an energy-constrained graph diffusion technique to learn inter-session item transition patterns. The energy-constraint function guides the global graph diffusion process and alleviates the bias between the graph structure diffusion learning representation and the actual value.
2.: The EGDLTP-CCL model employs a patch-enhanced gated graph neural network to learn intra-session item transition patterns. In particular, the model emphasizes the potential of the patch strategy to capture temporally varying properties of intra-session item transition patterns, enhancing the local semantic features of temporal items and capturing the full semantics through a time-step-based aggregation operation.
3.: The EGDLTP-CCL model employs a contrastive learning framework based on curriculum learning, in which the proposed novel negative sampling method based on curriculum learning enhances the model’s performance and generalization capability.
4.: The performance of the EGDLTP-CCL model is tested on three real-world datasets, and the experimental results show the significant superiority of the proposed model when compared to the state-of-the-art works.

2. Related Work

Our work covers two areas of literature. We review recent advances in the following aspect.

2.1. Deep Learning Based Session Recommendation

Deep learning has achieved great success in the field of recommendation, especially in the field of session recommendation. Hidasi et al. [18] first proposed a new session recommendation model-GRU4REC, which utilizes gated recurrent units to model sequential behaviors in a session. Li et al. [19] proposed integrated attention mechanisms to enhance RNNs to represent item transition patterns in sessions. Wu et al. [20] utilized an MLP with an attention mechanism to fuse different session context representations for learning a more comprehensive session representation.

In recent years, with the successful results achieved by graph neural networks in the field of AI. The seminal work is SR-GNN [21], which constructs session sequences as graph structures and uses gated graph neural networks to learn item representations. Feng et al. [22] pay more attention to the impact of the transition model on the target items. Qiu et al. [23] propose a weighted-attention GNN feature encoder to mine the user’s latent intent, while Wang et al. [24] propose a hypergraph attention network to model the contextual information of items at different granularity levels. Chen et al. [25] introduce a search algorithm to find the optimal graph structure during the composition process, while Zhuo et al. [26] propose a multi-hop memory transformer from a multi-view perspective to comprehensively capture the potential user intent. In contrast, our approach incorporates graph diffusion to model inter-session item transition patterns, and designs a patch-enhanced gating network to strengthen temporal influence modeling to further capture intra-session item transition patterns accurately.

2.2. Application of Contrastive Learning in Recommendation

Contrastive learning (CL) has received widespread attention as an emerging learning paradigm that supports training on large amounts of unlabeled data, alleviates data sparsity, and reduces the model’s dependence on labeled data. Xu et al. [27] employs a novel contrastive learning framework in graph structure to improve recommendation performance based on comprehensive consideration of local and global information during feature representation. Zhang et al. [28] design a hypergraph contrastive learning framework based on multiple granularity compositing strategies for learning different representations of user-user interactions in group recommendations. Wei et al. [29] combine information bottlenecks with contrastive learning to optimize mutual information maximization to improve recommendation performance. Zhao et al. [30] use a contrastive learning approach with disentangled causal embeddings to mitigate data sparsity and long-tailed items in recommendation. Yang et al. [31] propose a spectral domain-based spectral cueing guided by combining cueing learning with contrastive learning to focus on spatial domains with different frequencies of information at different views in a contrastive learning framework. In contrast, our approach employs a curriculum learning strategy to explore how to select better negative samples.

3. EGDLTP-CCL Model

The overall framework of the EGDLTP-CCL model is shown in Figure 1. The model is mainly composed of 4 main parts. (1) Inter-session item transition pattern representation learning. In this module, all session data are first utilized to construct a global synergy graph, and then energy-constraint-based graph diffusion is executed on the global synergy graph to efficiently learn global inter-session item transition patterns. (2) In-session item transition pattern representation learning. In this module, first, consistent with the literature [32], we utilize the disentangle technique for factor-based item initialization. Then, we design a patch-enhanced gated neural network to accurately model the complex transition relations between items in the current session, focusing on local semantics in the time series and comprehensively capturing the global semantics through the introduction of a transformer patch strategy. (3) Curriculum contrastive learning framework. First, intra-session item representation learning and inter-session item representation learning are used as two views of the contrast learning framework. The objective is to maximize the mutual information between the above two views, and select more optimal negative samples based on curriculum learning to further enhance the performance of the contrast task. (4) Fusion representation learning based on a multi-task strategy. In this module, the recommendation task and the contrastive learning task are unified based on a multi-task strategy by utilizing a joint training strategy.

3.1. Formulation

Assume that

V = {v_{1}, v_{2}, \dots, v_{m}}

is itemset, and m denotes the number of items. The set of sessions is denoted as

S = {s_{1}, s_{2}, \dots, s_{n}}

, and n denotes the number of session. For any session in the session dataset denoted as

s = {v_{s, 1}, v_{s, 2}, \dots, v_{s, N}}

, N denotes the number of items in session s, and

v_{s, ζ} \in V

(

1 \leq ζ \leq m

) denotes the item of the

ζ th

interaction in session s. All items are embedded in the same space, denoted as

x \in R^{d}

.

Given a session, the goal of EGDLTP-CCL is to output a probability distribution for each candidate item, where each element of the distribution represents the probability that the item will be recommended to the user, and finally, the recommended items are output based on the above.

3.2. Inter-Session Item Transition Pattern Representation Learning

Existing works on the inter-session item transition model prefer the traditional encoding approach of stacked multilayer graph convolution. Considering that there are many challenges in the coding approach of multilayer graph convolution [33], we propose an energy-constrained graph diffusion to learn inter-session item transition patterns using the graph diffusion learning approach in the session recommendation scenario. The steps are as follows:

D_{e_{n_{1}}, e_{n_{2}}} = D i s (x_{e_{n_{1}}}, x_{e_{n_{2}}}) e_{n_{1}}, e_{n_{2}} \in {1, 2, \dots, m}

(1)

A_{e_{n_{1}}, e_{n_{2}}} = 1 {D_{e_{n_{1}}, e_{n_{2}}} \in top - k} e_{n_{1}}, e_{n_{2}} \in {1, 2, \dots, m}

(2)

\bar{A} = Norm (\sum_{n = 1}^{N} \sum_{n = 1}^{N} A_{e_{n_{1}}, e_{n_{2}}}) e_{n_{1}}, e_{n_{2}} \in {1, 2, \dots, m}

(3)

where F denotes the Euclidean distance function,

D i s (\cdot)

denotes the distance between the global graph nodes

e_{n_{1}}, e_{n_{2}}

. The embedding vector

x_{e_{n_{1}}}, \dots, x_{e_{n_{2}}} \in R^{m \times D}

, realizes the feature embedding for all m item nodes. The top k nodes are selected and connected to obtain the initial sparse adjacency matrix A. The adjacency matrix A is normalized to obtain the final initial adjacency matrix

\bar{A}

.

Next, the L2 regularized weighted dot product diffusion function with energy constraints guides the information interactions between nodes in the diffusion process, and the results are as follows:

K^{k, h} = W_{K}^{k, h} x^{k}, U^{k, h} = W_{U}^{k, h} x^{k}, V^{k, h} = W_{V}^{k, h} x^{k}

(4)

K_{L}^{k, h} = {[\frac{K_{n}^{k, h}}{‖ K_{n}^{k, h} ‖_{2}}]}_{n = 1}^{m}, U_{L}^{k, h} = {[\frac{U_{n}^{k, h}}{‖ U_{n}^{k, h} ‖_{2}}]}_{n = 1}^{m}

(5)

E (x, k; g) = ∥ x - x^{k} ∥_{F}^{2} + ν \sum_{e_{n}, e_{m}} g (∥ x_{e_{n_{1}}} - x_{e_{n_{2}}} ∥_{2}^{2}) e_{n_{1}}, e_{n_{2}} \in {1, 2, \dots, m}

(6)

where

W_{K}^{k, h} \in R^{D \times D}

,

W_{U}^{k, h} \in R^{D \times D}

,

W_{V}^{k, h} \in R^{D \times D}

denote the parameters of the

h th

head of the

k th

layer.

K_{n}^{k, h}

denotes the

n th

row vector of

K_{L}^{k, h} \in R^{N \times D}

that realizes the global propagation of the

h th

head.

ν

represents the weight parameter. Function g is a monotonically increasing concave function. Thus, the diffusivity

P^{k}

is expressed as follows:

P^{k} = \frac{1}{h} \sum_{h = 1}^{h} \{d i a g^{- 1} (N o r m (U_{L}^{k, h} ({(K_{L}^{k, h})}^{T} ω))) [ω (ω^{T} V^{k, h}) + U_{L}^{k, h} ({(K_{L}^{k, h})}^{T} V^{k, h})]\}

(7)

where

ω

is an all-1

ω_{N \times 1}

vector,

d i a g (\cdot)

denotes diagonalization, and

N o r m (\cdot)

denotes regularization constraint.

P^{k}

denotes the diffusion probability between different item nodes.

Thus, the diffusion process, as follows:

Δ^{k} = \sum_{n}^{N} \sum_{m}^{N} P^{k} {\bar{A}}_{n, m}^{k}, x^{k + 1} = (1 - δ Δ^{k}) x^{k} + δ Δ^{k} x^{k}, E (x^{k + 1}, k; g) \leq E (x^{k}, k - 1; g) k \geq 1

(8)

x^{k + 1}

denotes the updating process of the item embedding.

δ

is a hyperparameter indicating the probability that the previous layer’s features are retained in this layer’s features. E denotes the energy constraint.

x^{k + 1}

is the global representation of all items between sessions after learning the output variable

X_{t}^{g}

through k-layer diffusion graph. The inter-session item transition pattern representation

s_{g} = \frac{1}{m} \sum_{t = 1}^{m} X_{t}^{g}

is obtained by performing a mean pooling operation on the item embeddings of the whole session.

3.3. Intra-Session Item Transition Pattern Representation Learning

The existing works prefer a simple temporal encoding approach when learning to generate in-session item transition pattern representations, which does not adequately consider the extent to which time-series variations affect users’ behavioral intentions [34]. Meanwhile, asynchronous timestamps can generate noise by incorporating irrelevant user intentions, leading to poor robustness of the session representation and degrading recommendation performance. Consequently, we propose a patch-enhanced gated neural network to learn intra-session item transition pattern representations. This approach aims to learn the exact user’s intention by comprehensively and accurately capturing intra-session temporal sequence dependencies through a transformer patch strategy.

First, consistent with the literature [32], a factor-based initialization of the items of the current session is performed to achieve a better fine-grained characterization of the items using factor features. The session sequence

s = {v_{s, 1}, \dots, v_{s, τ}, \dots, v_{s, n}}

is constructed as a session graph

G_{s} = (V_{s}, Λ_{s})

, where

V_{s}

and

Λ_{s}

denote the set of nodes and the set of edges of the corresponding session s, respectively.

V_{s}

consists of the items that the user has clicked within the current session s. The initial embedding

s_{x} = {x_{s, 1}^{(0)}, x_{s, 2}^{(0)}, \dots, x_{s, n}^{(0)}}

of the current session, where the initial embedding of the

τ th

item in the session s is denoted as

{\bar{x}}_{s, τ}^{(0)} \in R^{d}

. Each item embedding representation is projected into

Ω

different subspaces, where each subspace contains a potential factor feature of the item, utilizing the factor feature

h_{v_{τ}, Ω}

to better characterize the item at a finer granularity, as follows:

h_{ν_{τ}, Ω} = σ (W_{k}^{T} {\bar{x}}_{s, τ}^{(0)}) + b_{k}

(9)

where

W_{k} \in R^{d \times \frac{d}{Ω}}

and

b_{k} \in R^{\frac{d}{Ω}}

are learnable parameters and

σ (\cdot)

is sigmoid activation function. The factor embeddings use the

L_{2}

regularization and dropout strategy. Therefore, the initialized embedding representation of the

τ th

item

v_{τ}

is:

x_{s, τ}^{(0)} = [h_{ν_{τ}, 1}^{(0)}, . . ., h_{ν_{τ}, k}^{(0)}, . . ., h_{ν_{τ}, Ω}^{(0)}] \in R^{d}

and the representation of the session denoted by this item can be updated to

s_{x} = {x_{s, 1}, . . ., x_{s, τ}, . . ., x_{s, n}}

.

Patch-enhanced gating neural network. We designed a patch-enhanced gating neural network aimed at capturing the temporal relationships between different items after completing the initialization of item factor embedding, and further accurately learning the inter-item responsible for the transition relationships. The patch-enhanced gating neural network consists of a GGNN and a patch-enhanced LSTM.

Gated graph neural networks. First, we utilize the GGNN model to learn complex transformations between different item embeddings. Specifically, item factor embeddings are first utilized for message propagation on the intra-session item graph. The item factors are considered to be the smallest representation units in the message propagation process, and then the GGNN is used to learn the embedding representation of the item representations on a factor-by-factor level, as follows:

ρ_{ν_{t}, k}^{i} = C o n c a t (A_{s, k, ν_{t}}^{i n} x_{s, τ}^{i - 1} W^{i n} + b^{i n}, A_{s, k, ν_{t}}^{o u t} x_{s, τ}^{i - 1} W^{o u t} + b^{o u t})

(10)

\begin{matrix} θ_{ν_{t}, k}^{i} = σ (W_{z} ρ_{ν_{t}, k}^{i} + U_{z} h_{ν_{t}, k}^{i - 1}), \\ r_{ν_{t}, k}^{i} = σ (W_{r} ρ_{ν_{t}, k}^{i} + U_{r} h_{ν_{t}, k}^{i - 1}), \\ {\tilde{h}}_{ν_{t}, k}^{i} = tanh (W_{o} ρ_{ν_{t}, k}^{i} + U_{o} (θ ⊙ h_{ν_{t}, k}^{i - 1})), \\ h_{ν_{t}, k}^{i} = (1 - z_{ν_{t}, k}^{i}) ⊙ h_{ν_{t}, k}^{i - 1} + θ_{ν_{t}, k}^{i} ⊙ {\tilde{h}}_{ν_{t}, k}^{i} \end{matrix}

(11)

where

h_{v_{t}, k}^{i - 1}

denotes the embedded representation of item

v_{t}

in the k-th factor in the

i - 1 th

layer of the GGNN,

A_{s, k, v_{t}}^{in}

and

A_{s, k, v_{t}}^{out}

denote the

t th

row in matrices

A_{s, k}^{in}

and

A_{s, k}^{out}

, respectively, and

b^{in}, b^{out} \in R^{\frac{d}{K}}

is a learnable parameter. Then, by combining the representation of item factor

h_{v_{t}, k}^{i - 1}

in

i - 1 th

layer and its neighbors in

i th

layer for representing the transformation information of different neighboring items in the session to the target item.

Patch-enhanced LSTM: We propose a patch-enhanced LSTM network to capture the temporal behavior of the item transition pattern. We first generate a patch-based temporal coding function and then fuse it with LSTM to obtain a patch-enhanced LSTM network that reflects the monotonic change in user intention as it evolves over time.

First, we propose a patch transformer-based temporal encoding function into a low-dimensional embedding representation. Initially, the time series are segmented to obtain multiple patches based on the temporal dimension of the timestamps. Next, the obtained patches are independently fed into the transformer framework. Then, the transformer framework is used to concatenate and integrate the patches’ outputs to yield the final temporal coding function:

Γ (t) = {\hat{t}}^{(i)} \in R^{1 \times t}

.

The

i th

univariate time series of length L is represented as follows:

t_{1 : L}^{(i)} = (t_{1}^{(i)}, . . ., t_{L}^{(i)})

(12)

where

i = 1, . . ., γ

. Thus,

t_{1}, . . ., t_{L}

is split to

γ

univariate series

t^{(i)} \in R^{1 \times L}

. The patch length is denoted as

ξ

, and the stride, the non-overlapping region between two consecutive patches, is denoted as

ϕ

. Then, the patch sequence

t_{ξ}^{(i)} \in R^{ξ \times ϕ}

,

λ_{t}

is generated based on the patches and

λ_{t}

is the number of time patches, denoted as follows:

λ_{t} = ⌊\frac{L - ξ}{ϕ}⌋ + 2

(13)

By using patches, the number of input tokens dropped to a relatively low level. The

ϕ

repeated numbers of the last value

t_{L}^{(i)}

are padded to the end of the original sequence before splitting. The nuisance caused by asynchronous timestamps on sequences was also mitigated by the patch being re-partitioned for aggregation.

Then, we use the vanilla transformer framework [35], and the positional encoding represented based on enter patch

t_{d}^{(i)}

, as follows:

t_{d}^{(i)} = W_{ξ} t_{ξ}^{(i)} + W_{p o s i t i o n}

(14)

Then, the transformer framework is represented as follows, with the final output

O_{h}^{(i)} \in R^{D \times N}

:

Q_{h}^{(i)} = {(t_{d}^{(i)})}^{T} W_{h}^{Q}, K_{h}^{(i)} = {(t_{d}^{(i)})}^{T} W_{h}^{K}, V_{h}^{(i)} = {(t_{d}^{(i)})}^{T} W_{h}^{V}

(15)

{(O_{h}^{(i)})}^{T} = A t t e n t i o n (Q_{h}^{(i)}, K_{h}^{(i)}, V_{h}^{(i)}) = Softmax (\frac{Q_{h}^{(i)} K_{h}^{{(i)}^{T}}}{\sqrt{d_{k}}}) V_{h}^{(i)}

(16)

where d is the time dimension of the patch and each header

h = 1, . . ., H

in the transformer framework.

Q_{h}^{(i)}

denotes the query matrix,

K_{h}^{(i)}

denotes the keyword matrix, and

V_{h}^{(i)}

denotes the value matrix.

W_{h}^{K}, W_{h}^{Q} \in R^{D \times d_{k}}, W_{h}^{V} \in R^{D \times D}

.

Then, the above temporal coding function is fused into the LSTM network:

\begin{matrix} f = σ (W_{f} h_{ν_{t}, k}^{i} + U_{f} (h_{ν_{t}, k}^{i - 1} + Γ (t))), \\ μ_{f} = σ (W_{i} h_{ν_{t}, k}^{i} + U_{i} (h_{ν_{t}, k}^{i - 1} + Γ (t))), \\ {\hat{h}}_{ν_{t}, k}^{i} = tanh (W_{t} h_{ν_{t}, k}^{i} + U_{t} (h_{ν_{t}, k}^{i - 1} + Γ (t))), \\ h_{ν_{t}, k}^{'} = f ⊙ h_{ν_{t, k}}^{i - 1} + μ_{f} ⊙ {\hat{h}}_{ν_{t}, k}^{i} \end{matrix}

(17)

where

W_{f}, W_{i}, U_{f}, U_{i}, U_{t} \in R^{\frac{d}{k} \times \frac{d}{k}}

are all learnable parameters,

Γ (\cdot)

is a time-encoded function describing the temporal ordering relationship, and

Γ (t)

denotes the

t th

value of the timestamp embedding. The item representation

x_{s, ε}^{'} = [h_{ν_{t}, 1}^{'}, . . ., h_{ν_{t}, k}^{'}, . . ., h_{ν_{t}, N}^{'}]

is obtained, where

x_{s, ε}^{'}

denotes the

ε th

item embedding representation in the current session s.

We learn the embedded representations of all items in each session through the self-attention mechanism and pooling operations to obtain a fine-grained session representation as follows:

η_{t} = softmax [\frac{Q_{η} K_{η}^{t}}{\sqrt{d_{k}}}] V_{η} + x_{s, ε}^{'}

(18)

where

η_{t}

is the output of the self-attention computation and

x_{s, ε}^{'}

is the input item embedding required to generate the final session embedding.

Q_{η} = x_{s, ε}^{'} W_{η}^{Q}

is the query matrix,

K_{η}^{t} = x_{s, ε}^{'} W_{η}^{K}

is the key matrix,

V_{η} = x_{s, ε}^{'} W_{η}^{V}

is the value matrix,

W_{η}^{Q}, W_{η}^{K}, W_{η}^{V}

is the weight matrix,

s o f t m a x (\cdot)

is the activation function.

Eventually, the item embedding within the entire session are subjected to the above score-weighted mean pooling operation

s_{l} = \frac{1}{m} \sum_{t = 1}^{m} η_{t} x_{s, ε}^{'}

to obtain an intra-session item transformation pattern representation

s_{l} = [{\tilde{x}}_{s, 1}, . . ., {\tilde{x}}_{s, t}, . . ., {\tilde{x}}_{s, n}]

, where the factor embedding of each item can be represented as

{\tilde{x}}_{s, t} = [h_{v_{t}, 1}^{"}, . . ., h_{v_{t}, k}^{"}, . . ., h_{v_{t}, K}^{"}]

.

3.4. Curriculum Contrastive Representation Learning

3.4.1. Sampling of Curriculum Learning

First,

{\tilde{z}}_{l}

(or

{\tilde{z}}_{g}

) is a negative sample obtained by performing a column transformation on

s_{l}

(or

s_{g}

). The different negative sample embeddings

{\tilde{z}}_{l}

(or

{\tilde{z}}_{g}

) are mapped to the

S c o r e (\cdot)

function. For the scoring function

S c o r e (\cdot)

, we use similarity

s i m (s_{l}, {\tilde{z}}_{l})

to measure the difficulty of the negative samples, thus obtaining an accurate negative sample ranking for the positive samples, where the similarity

s i m (\cdot)

is calculated using the cosine similarity method, as follows:

S c o r e ({\tilde{z}}_{l}) = s i m (s_{l}, {\tilde{z}}_{l}) = \frac{|s_{l} \cdot \sum_{m = 1}^{m} {\tilde{z}}_{l}|}{|s_{l}| \cdot |\sum_{m = 1}^{m} {\tilde{z}}_{l}|}

(19)

where

s_{l}

is the

{\tilde{z}}_{l}

’s corresponding representation of the positive sampling embedding.

Next, after obtaining the scores of individual negatively sampled embeddings

{\tilde{z}}_{l}

throughout the contrastive learning process, the pacing function

Υ (ψ)

is used to sequentially feed negative samples into the training process. The pacing function

Υ (ψ)

specifies the sample size of the negative samples in each step

ψ

. The first step consists of the lowest-scoring negative samples. Step

ψ

consists of the negative samples with the lowest

Υ (ψ)

scores. The batch of negative samples is sampled uniformly from this set, and

Δ

denotes the total number of the above training steps.

Υ (ψ) = {(ψ / Δ)}^{ι} \cdot ϖ

(20)

where l is a smoothing parameter that controls the speed at which the pacing function guides the training process.

ϖ

denotes the total number of negative samples. During the contrastive learning process, the negative samples are sorted from the lowest score to the highest score according to the scoring function, thus facilitating the introduction of training, which makes the contrastive framework training converge faster and perform better.

3.4.2. The Loss of Contrastive Learning Framework

Consistent with the literature [2], we utilize the InfoNCE to construct the loss function of the proposed model based on the binary cross-entropy loss between positive and corrupted (negative) samples, defined as:

L_{c l} = - log σ (f_{c l} (s_{l}, s_{g})) - log σ (1 - f_{c l} ({\tilde{z}}_{l}, s_{g}))

(21)

f_{c l} (s_{l}, s_{g}) = σ (S_{l} \cdot S_{g}^{T})

(22)

where

{\tilde{z}}_{l}

(or

{\tilde{z}}_{g}

) is the negative sample obtained after row and column shuffling based on the corrupted

s_{l}

(

s_{g}

). Meanwhile, we design

f_{c l} ()

in Equation (22) as a dot product function between two vectors to better learn the score between two input vectors. The learning objective of the loss function in Equation (21) is to maximize the mutual information between the learned session embeddings in different views and to enhance the maximization of the consistency between the positive samples.

3.5. Prediction and Optimization Based on Fusion Representation of Multi-Task Strategies

3.5.1. Prediction

Consistent with the literature [2], we construct the recommendation task loss function based on the intra-session item transition model. After obtaining the intra-session representation and item representation, a factor-by-factor dot product operation on the target session s and candidate items

x_{s, ε}^{'}

is required to compute the scores

{\hat{z}}_{s, ε}

of all candidate items.

{\hat{z}}_{s, ε} = \sum_{k = 1}^{K} s \cdot x_{s, ε}^{'}

(23)

{\hat{y}}_{t} = S o f t max ({\hat{z}}_{s, ε})

(24)

where

{\hat{y}}_{ε}

denotes the probability of predicting the

ε

th candidate item in the session s.

3.5.2. Optimization

Next, for each session, we use cross-entropy to estimate the loss function for the recommendation task as follows:

L_{r e c} = - \sum_{ε = 1}^{n} y_{ε} log ({\hat{y}}_{ε}) + (1 - y_{ε}) log (1 - {\hat{y}}_{ε})

(25)

where

y_{ε}

is the one-hot vector of true value items. Then the recommendation task and the contrastive learning task are combined based on the multitasking strategy, and the final loss function of EGDLTP-CCL is defined as:

L_{t} = L_{r e c} + λ_{L} L_{c l}

(26)

where $λ_{L}$ is the multitasking policy parameter. The pseudocode for the training process of EGDLTP-CCL model is shown as Algorithm 1.

Algorithm 1 The learning algorithm for the EGDLTP-CCL model

Input: Item data V, session data S.

Output: Recommendation list.

for each epoch do:

for the Inter-session item transition pattern representation learning each epoch do:

Calculate the adjacency matrix by Equations (1)–(3).

Calculate the diffusion rate by Equations (4)–(6).

Perform diffusion map data embedding and updating by Equations (7) and (8).

end

for the intra-session item transition pattern representation learning each epoch do:

Fine-grained item feature representation by Equation (9).

Learning complex transitions between different item embeddings based on a gated graph neural network by Equations (10) and (11).

Capturing time behavior of item transition patterns based on patch-enhanced LSTM network by Equations (12)–(17).

end

for each session do:

Sampling and training based on curriculum learning by Equations (18)–(20).

Calculation of the contrastive learning loss for both inter-session and intra-session representations by Equations (21) and (22).

Predicting the probability of the next item by Equations (23) and (24).

Obtain the next predicted loss by Equation (25).

end

Joint optimization of the overall objective in Equation (26).

end

3.5.3. Time Complexity

The time complexity of graph diffusion mainly comes from graph construction and the diffusion rate. The computational complexity of graph construction is consistent with GCN, which can be optimized to

O (n^{2})

, where n is the number of nodes. The diffusion rate is based on the spatial feature embedding and the attention mechanism, which generates

λ_{ϖ}

queries, and the time complexity of the diffusion rate is

O (λ_{ϖ} \cdot τ \cdot ζ_{n}^{2})

, while denotes the dimension of the attention mechanism and

ζ_{n}

is the dimension of the feature vector. Then, updating the information between nodes and performing multilayer diffusion, where

ς

denotes the number of layers of graph diffusion. Therefore, the total time complexity of graph diffusion is

O (n^{2} + λ_{ϖ} \cdot ς \cdot τ \cdot ζ_{n}^{2})

.

4. Experiment

4.1. Datasets

We use three real-world datasets for performance evaluation, as follows:

The Yoochoose dataset is from the 2015 RecSys Challenge and contains data on user clicks to e-commerce websites.
The RetailRocket dataset is user activity data recorded by a personalized e-commerce company over a six-month period.
Tmall dataset containing anonymized behavioral logs of Tmall users over a six-month period, which is published in the IJCAI 2015 competition.
Diginetica dataset is from the CIKM Cup 2016, containing over 6 months of click-through data about users.

Specifically, for each dataset, we remove sessions with only one item and infrequent users and items that appear less than five times in each dataset. Then, the data from the last week of the dataset are used as the test set, and the rest of the data are used as the training set. Subsequently, if the interval between two consecutive items in a user’s interaction history was less than 6 h, they were placed into a session. On the contrary, if the interval is greater than 6 h, they are put into two different sessions, thus exploiting better segmentation. In addition, we enrich the data by segmenting the sequences and generating the corresponding labels, training dataset, and test dataset for augmentation and labeling. In this regard, we are similar to the literature [36,37] in that we use data augmentation for sequence data. The method generates multiple labeled sequences with corresponding labels, each of which is labeled with the last clicked item in it. Table 1 shows the statistics of each preprocessed dataset.

4.2. Metric

Two evaluation metrics are chosen for the experiment.

$p @ k$ is the percentage of real items in the top items that generate recommendations, which measures the accuracy of the model’s predictions:

$p @ k = \frac{n_{h i t}}{N_{p}}$

(27)

where $N_{p}$ denotes the number of sessions and $n_{h i t}$ is the number of ground-truth items in the generated recommendation list.
$M R R @ k$ (Mean Reciprocal Rank) represents the inverse of the rank of the $t o p - k$ items in the generated recommendation list, which measures the positional relationship between the items. The larger value indicates that the real items are ranked more important, which also means that the model performs better, as shown below:

$M R R @ k = \frac{1}{N_{p}} \sum_{i = 1}^{N_{p}} \frac{1}{r a n k_{i}}$

(28)

where $r a n k_{i}$ is the sorted position of the ground-truth in the list predicted by the i-th session.

4.3. Experimental Setup

Baselines. We selected seven state-of-the-art session recommendations and compared the proposed model with these methods, which are described below:

GRU4REC [18] is an RNN-based session recommendation that aims to capture sequential dependency patterns between items in a session.
NARM [19] is a recommendation model that integrates an attention mechanism on an RNN-based approach to enhance the sequence representation capability of RNN models.
STAMP [20] is a recommendation model that integrates attention mechanisms in MLP to fuse users’ long and short-term interests.
SR-GNN [21] is a GNN-based session recommendation method and effectively combines users’ general interests and short-term preferences.
TAGNN [22] proposed a target-aware attention network to model the dynamic process of user interest with different target items in order to generate recommendations for users.
DHCN [2] is a hypergraph-based approach to capture higher-order information between items, and it also introduces contrastive learning to integrate information between different sessions.
Disen-GNN [32] is a session recommendation method for disentangled GNNs, where the aim of disentanglement learning is to model factor-level features of items.
HIDE [36] introduces hypergraphs into session recommendations that comprehensively capture user preferences from different perspectives.
MSGAT [37] brings two-channel sparse attention networks into session recommendation to mitigate in-session noise, which models user preferences based on two parallel channels integrating intra- and inter-session information.

Table 2 shows the parameter settings of EGDLTP-CCL in the experiments. For baselines, all parameters are initialized to a Gaussian distribution with 0 and a standard deviation of 0.1. Then the batch size is set to 100, L2 regularization to

10^{- 5}

, and the Adam optimization algorithm with the initial learning rate of 0.005 is chosen, and the learning rate is set to 0.001. The parameter

λ_{L}

in the multitasking strategy is selected from

{0.1, 0.01, 0.001, 0.0001}

.

4.4. Analysis of Experimental Results

4.4.1. Overall Comparison

The experimental results of the full comparison in Table 2 show that the EGDLTP-CCL model achieves superior recommendation performance, and the following conclusions are analyzed and obtained:

GRU4REC uses a gated loop unit to capture sequential information in session sequences and achieves good results. However, the poorer performance of GRU4REC on the Tmall dataset may be due to the inability of gating recurrent units alone to handle the dynamic drift of user preferences. Therefore, it is necessary to provide additional time-signal modeling to strengthen the ability to model user preferences. NRAM performs much better than GRU4REC because it inherits the advantage of GRU4REC in capturing sequential information and integrates an attention mechanism on top of it, which enables the model to distinguish that different items in the session contribute differently to the user’s overall preference. STAMP replaces the RNN structure with an MLP and, at the same time, takes into account the varying importance of the user’s long- and short-term interests, thus achieving more advanced recommendation performance.

All the GNN-based models achieve better performance. Unlike SR-GNN, TA-GNN proposes a target-aware attention mechanism that considers the impact of the predicted target candidate items on all items within the session. Whereas DHCN can outperform TA-GNN and SR-GNN, it further introduces hypergraphs into session recommendation, which not only captures higher-order information between items, but also integrates inter-session information in combination with self-supervised contrastive learning. While Disen-GNN considers multiple blending factors influencing the user’s preferences and comprehensively models the user’s latent intent through the disentanglement technique. HIDE explores more complex transformational relationships using hypergraphs to model higher-order relationships, which ignore the importance of inter-session information. MSGAT employs a two-channel sparse attention network to model intra- and inter-session information that is further integrated for capturing user preferences with effective noise reduction.

EGDLTP-CCL significantly outperforms baselines on all datasets. For the above experimental results, we analyze them from the following aspects:

First, EGDLTP-CCL models session sequences at the global and local levels, respectively. From the perspective of global modeling, we first consider the different effects that exist between different sessions. Then, in the process of learning global-level item representations using an energy-constraint-based graph diffusion technique, the structure of real item transition patterns between global sessions can be modeled more accurately by the newly designed energy-constraint diffusion. On the other hand, for intra-session learning of item transition mode representations, the patch-enhanced gating network designed in this paper can do a good job of capturing the sequential information of the items in a session, which demonstrates the importance of modeling the effect of temporal variations of the items in session recommendation, because the session sequences are inherently tightly correlated with temporal signals. Third, the proposed curriculum learning optimization strategy for session learning can select better negative samples.

Meanwhile, EGDLTP-CCL achieves leading performance on all three datasets, but its performance on the Yoochoose dataset is relatively lower than that on the Tmall and RetailRocket datasets. The possible reason is that both Tmall and RetailRocket datasets are collected in real e-commerce environments, which have higher data consistency compared to the other datasets, suggesting that our model is more adapted to real-world e-commerce environments. For the Diginetica dataset, the data are from the 2016 CIKM Cup, where the data sources have changed. But the length of the sessions in the dataset is similar to the RetailRocket dataset. Thus, the final result of EGDLTP-CCL is also still excellent. It also indicates that the robustness of EGDLTP-CCL is still very good.

4.4.2. Ablation Analysis

In order to verify the validity of each module of EGDLTP-CCL, we conduct an analysis of ablation experiments, the results are shown in Figure 2. EGDLTP-CCL-1 indicates the removal of the energy-constrained graph diffusion learning module, EGDLTP-CCL-2 indicates the removal of the patch-enhanced gating network module, EGDLTP-CCL-3 indicates the removal of the curriculum contrastive learning task module, EGDLTP-CCL-4 indicates the removal of the curriculum learning strategy of the contrastive learning framework.

The results are shown in Figure 2. The following conclusions are obtained:

1.: EGDLTP-CCL-2 has the worst performance. When the patch-enhanced gated network module is removed, a large amount of important item information is hidden within the session, and contextual temporal information is ignored. Meanwhile, it is shown that accurately obtaining the impact of time-varying correlations in the transition patterns between items within a session is also important for session recommendation performance enhancement.
2.: EGDLTP-CCL-1 ranks fourth among the five models. When the energy-constrained graph diffusion learning module is removed and the intra-session item switching patterns are learned only by patch augmented gated neural networks, it is not able to fit the inter-session item transition patterns adequately, resulting in a degradation of the overall performance of EGDLTP-CCL.
3.: EGDLTP-CCL-3 is the third among the five models, which justifies the fact that the curriculum contrastive learning task integrates the learned intents into the session based on the curriculum learning recommendation model to further improve the performance and robustness.
4.: EGDLTP-CCL-4 is the second among the five models, which shows that the negative sample selection strategy based on curriculum learning can further improve the performance of the contrastive learning task.

4.4.3. Parameter Sensitivity Analysis

In this section, we evaluate the impact on the performance of EGDLTP-CCL by setting different hyperparameters. Specifically, we explore the different effects that the number of layers of the graph diffusion network in graph diffusion representation learning and the model learning rate on EGDLTP-CCL.

Figure 3 shows the results for different numbers of graph diffusion layers. The Yoochoose dataset and the RetailRocket dataset obtain optimal results when the number of graph diffusion layers is two. This is due to its ability to fully capture the complex relationships of the graph structure. The number of layers in the graph diffusion is important to accurately capture the hidden key information in the item transition pattern.

Figure 4 shows the results for different learning rates. The Yoochoose dataset and RetailRocket dataset have better results when the learning rate is 0.001. The possible reasons for this are: if the learning rate is too large, EGDLTP-CCL will not converge, and if the learning rate is too small, EGDLTP-CCL will converge slowly.

Figure 5 shows the results based on different datasets. Clearly, the impact

λ_{L}

on model performance is obvious. When

λ_{L} = 0.01

, the model performance reaches its best. A possible reason is the gradient conflict between the contrastive learning task and the recommendation task.

In Figure 6, we show the results of MRR@20 and P@20 based on the Yoochoose dataset and RetailRocket set. The curves in the figure demonstrate that the values of MRR@20 and P@20 stabilize with the gradual change in epoch. The values corresponding to different datasets reaching the stabilized epoch are different, which is also caused by different data characteristics of the two datasets.

5. Conclusions

We propose a new session recommendation model (EGDLTP-CCL) that fuses both inter-session and intra-session representation to learn item transition pattern representations, and then further improves recommendation performance based on curriculum contrastive learning. Specifically, first, an energy-constrained graph diffusion method is utilized to capture inter-session item transition patterns, which improves the diffusion efficiency through the introduction of energy-constrained strategies. Meanwhile, we also develop a patch-enhanced gating network to capture intra-session item transition patterns, which further focuses on the temporal impact of transition patterns between items in a session through a well-designed network of patch transformer strategies. In addition, a curriculum contrastive learning task is designed to maximize the mutual information of item transition pattern representations intra-session and inter-session, and the curriculum learning strategy helps the contrastive learning framework to select more optimal negative samples, which further enhances the performance of the contrastive task. The effectiveness of the EGDLTP-CCL is validated through extensive experiments on three datasets.

EGDLTP-CCL does not take into account the dynamic changes in the diffusion boundary during the graph diffusion process, which limits the range of the model’s receptive field and affects the further improvement of the model’s performance. Meanwhile, for the patch method, EGDLTP-CCL mainly emphasizes capturing the long-term dependencies of long-term time series and pays little attention to the short-term correlations within the patches, which fails to accurately focus on the local semantic representations and affects the further improvement of the model performance. In the future, we consider adaptively changing graph diffusion models that can adapt to the dynamic changes in diffusion boundaries to a greater extent. At the same time, we develop patch-based spatio-temporal models capable of capturing features at different time scales, thus capturing the global and local semantics of the model at a fine-grained level.

The graph diffusion approach indeed has a great potential to be explored in a wider range of application scenarios and complex graph structures. The core advantage of the proposed method lies in its ability to model the “local interaction → global propagation” law, which is highly compatible with the operation mechanism of a large number of complex application scenarios in the real world. For application scenarios, the proposed method has a wide range of applications in the direction of personalized diffusion enhancement in recommender systems, urban computing-transportation network, and resource scheduling, and biomedical-molecular and disease association analysis. For different types of graph-structured data, the proposed method can achieve a breakthrough in combining heterogeneity, dynamics, higher-order relationships, and other complex graph-structured features for hypergraph-based higher-order diffusion models, dynamic graph-based time-aware diffusion models, and multimodal data diffusion.

Author Contributions

Conceptualization, J.L. and R.G.; methodology, J.L. and R.G.; software, X.P. and R.G.; validation, Q.Y. and J.L.; formal analysis, J.L. and L.Y.; investigation, J.L. and R.G.; resources, X.P. and R.G.; data curation, J.L. and X.P.; writing—original draft preparation, R.G. and L.Y.; writing—review and editing, J.H. and X.P.; visualization, J.H. and L.Y.; supervision, X.P. and J.H.; project administration, Q.Y. and J.H.; funding acquisition, R.G. and L.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China (62472149), the Chunhui Plan Collaborative Research Project, Ministry of Education, China (HZKY20220350).

Data Availability Statement

All data are available publicly online, and they are also available on request from the corresponding author.

Conflicts of Interest

Author Quanfeng Yao, Xianjun Peng, and Jiwei Hu are employed by the China–Chile ICT Belt and Road Joint Laboratory, Wuhan 430205, China. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Guo, L.; Yin, H.; Wang, Q.; Chen, T.; Zhou, A.; Quoc Viet Hung, N. Streaming session-based recommendation. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, Anchorage, AK, USA, 4–8 August 2019; pp. 1569–1577. [Google Scholar]
Xia, X.; Yin, H.; Yu, J.; Wang, Q.; Cui, L.; Zhang, X. Self-supervised hypergraph convolutional networks for session-based recommendation. AAAI Conf. Artif. Intell. 2021, 35, 4503–4511. [Google Scholar] [CrossRef]
Deng, Z.H.; Wang, C.D.; Huang, L.; Lai, J.H.; Yu, P.S. G 3 SR: Global graph guided session-based recommendation. IEEE Trans. Neural Netw. Learn. Syst. 2022, 34, 9671–9684. [Google Scholar] [CrossRef] [PubMed]
Hidasi, B.; Karatzoglou, A. Recurrent neural networks with top-k gains for session-based recommendations. In Proceedings of the 27th ACM International Conference on Information and Knowledge Management, Torino, Italy, 22–26 October 2018; pp. 843–852. [Google Scholar]
Li, Z.; Yang, C.; Chen, Y.; Wang, X.; Chen, H.; Xu, G.; Yao, L.; Sheng, M. Graph and sequential neural networks in session-based recommendation: A survey. ACM Comput. Surv. 2024, 57, 1–37. [Google Scholar] [CrossRef]
Wang, Z.; Wei, W.; Cong, G.; Li, X.L.; Mao, X.L.; Qiu, M. Global context enhanced graph neural networks for session-based recommendation. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, Virtual, 25–30 July 2020; pp. 169–178. [Google Scholar]
Chen, Y.H.; Huang, L.; Wang, C.D.; Lai, J.H. Hybrid-order gated graph neural network for session-based recommendation. IEEE Trans. Ind. Inform. 2021, 18, 1458–1467. [Google Scholar] [CrossRef]
Xia, X.; Yin, H.; Yu, J.; Shao, Y.; Cui, L. Self-supervised graph co-training for session-based recommendation. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management, Virtual, 1–5 November 2021; pp. 2180–2190. [Google Scholar]
Wu, L.; Sun, P.; Fu, Y.; Hong, R.; Wang, X.; Wang, M. A neural influence diffusion model for social recommendation. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval, Paris, France, 21–25 July 2019; pp. 235–244. [Google Scholar]
Li, Z.; Sun, A.; Li, C. Diffurec: A diffusion model for sequential recommendation. ACM Trans. Inf. Syst. 2023, 42, 1–28. [Google Scholar] [CrossRef]
Wu, L.; Li, J.; Sun, P.; Hong, R.; Ge, Y.; Wang, M. Diffnet++: A neural influence and interest diffusion network for social recommendation. IEEE Trans. Knowl. Data Eng. 2020, 34, 4753–4766. [Google Scholar] [CrossRef]
Qin, Y.; Wu, H.; Ju, W.; Luo, X.; Zhang, M. A diffusion model for POI recommendation. ACM Trans. Inf. Syst. (TOIS) 2023, 42, 1–27. [Google Scholar] [CrossRef]
Li, J.; Wang, Y.; McAuley, J. Time interval aware self-attention for sequential recommendation. In Proceedings of the 13th International Conference on Web Search and Data Mining, Houston, TX, USA, 3–7 February 2020; pp. 322–330. [Google Scholar]
Ye, W.; Wang, S.; Chen, X.; Wang, X.; Qin, Z.; Yin, D. Time matters: Sequential recommendation with complex temporal information. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, Virtual, 25–30 July 2020; pp. 1459–1468. [Google Scholar]
Fan, Z.; Liu, Z.; Zhang, J.; Xiong, Y.; Zheng, L.; Yu, P.S. Continuous-time sequential recommendation with temporal graph collaborative transformer. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management, Virtual, 1–5 November 2021; pp. 433–442. [Google Scholar]
Wu, Z.; Wang, X.; Chen, H.; Li, K.; Han, Y.; Sun, L.; Zhu, W. Diff4rec: Sequential recommendation with curriculum-scheduled diffusion augmentation. In Proceedings of the 31st ACM International Conference on Multimedia, Ottawa, ON, Canada, 29 October–3 November 2023; pp. 9329–9335. [Google Scholar]
Yang, Y.; Zhang, C.; Zhou, T.; Wen, Q.; Sun, L. Dcdetector: Dual attention contrastive representation learning for time series anomaly detection. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Long Beach, CA, USA, 6–10 August 2023; pp. 3033–3045. [Google Scholar]
Quadrana, M.; Karatzoglou, A.; Hidasi, B.; Cremonesi, P. Personalizing session-based recommendations with hierarchical recurrent neural networks. In Proceedings of the Eleventh ACM Conference on Recommender Systems, Como, Italy, 27–31 August 2017; pp. 130–137. [Google Scholar]
Li, J.; Ren, P.; Chen, Z.; Ren, Z.; Lian, T.; Ma, J. Neural attentive session-based recommendation. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, Singapore, 6–10 November 2017; pp. 1419–1428. [Google Scholar]
Liu, Q.; Zeng, Y.; Mokhosi, R.; Zhang, H. STAMP: Short-term attention/memory priority model for session-based recommendation. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, London, UK, 19–23 August 2018; pp. 1831–1839. [Google Scholar]
Wu, S.; Tang, Y.; Zhu, Y.; Wang, L.; Xie, X.; Tan, T. Session-based recommendation with graph neural networks. AAAI Conf. Artif. Intell. 2019, 33, 346–353. [Google Scholar] [CrossRef]
Yu, F.; Zhu, Y.; Liu, Q.; Wu, S.; Wang, L.; Tan, T. TAGNN: Target attentive graph neural networks for session-based recommendation. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, Virtual, 25–30 July 2020; pp. 1921–1924. [Google Scholar]
Qiu, R.; Li, J.; Huang, Z.; Yin, H. Rethinking the item order in session-based recommendation with graph neural networks. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management, Beijing, China, 3–7 November 2019; pp. 579–588. [Google Scholar]
Wang, J.; Ding, K.; Zhu, Z.; Caverlee, J. Session-based recommendation with hypergraph attention networks. In Proceedings of the 2021 SIAM International Conference on Data Mining (SDM), Virtual, 29 April–1 May 2021; pp. 82–90. [Google Scholar]
Chen, J.; Zhu, G.; Hou, H.; Yuan, C.; Huang, Y. AutoGSR: Neural architecture search for graph-based session recommendation. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, Madrid, Spain, 11–15 July 2022; pp. 1694–1704. [Google Scholar]
Zhuo, X.; Qian, S.; Hu, J.; Dai, F.; Lin, K.; Wu, G. Multi-hop multi-view memory transformer for session-based recommendation. ACM Trans. Inf. Syst. 2024, 42, 1–28. [Google Scholar] [CrossRef]
Xu, M.; Wang, H.; Ni, B.; Guo, H.; Tang, J. Self-supervised graph-level representation learning with local and global structure. In Proceedings of the International Conference on Machine Learning, PMLR, Virtual, 18–24 July 2021; pp. 11548–11558. [Google Scholar]
Zhang, J.; Gao, M.; Yu, J.; Guo, L.; Li, J.; Yin, H. Double-scale self-supervised hypergraph learning for group recommendation. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management, Virtual, 1–5 November 2021; pp. 2557–2567. [Google Scholar]
Wei, C.; Liang, J.; Liu, D.; Wang, F. Contrastive graph structure learning via information bottleneck for recommendation. Adv. Neural Inf. Process. Syst. 2022, 35, 20407–20420. [Google Scholar]
Zhao, W.; Tang, D.; Chen, X.; Lv, D.; Ou, D.; Li, B.; Jiang, P.; Gai, K. Disentangled causal embedding with contrastive learning for recommender system. In Proceedings of the Companion Proceedings of the ACM Web Conference 2023, Austin, TX, USA, 30 April–4 May 2023; pp. 406–410. [Google Scholar]
Yang, K.; Han, H.; Jin, W.; Liu, H. Spectral-Aware Augmentation for Enhanced Graph Representation Learning. In Proceedings of the 33rd ACM International Conference on Information and Knowledge Management, Boise, ID, USA, 21–25 October 2024; pp. 2837–2847. [Google Scholar]
Li, A.; Cheng, Z.; Liu, F.; Gao, Z.; Guan, W.; Peng, Y. Disentangled graph neural networks for session-based recommendation. IEEE Trans. Knowl. Data Eng. 2022, 35, 7870–7882. [Google Scholar] [CrossRef]
Cao, Z.; Li, J.; Wang, Z.; Li, J. Diffusione: Reasoning on knowledge graphs via diffusion-based graph neural networks. In Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Barcelona, Spain, 25–29 August 2024; pp. 222–230. [Google Scholar]
Dang, Y.; Yang, E.; Guo, G.; Jiang, L.; Wang, X.; Xu, X.; Sun, Q.; Liu, H. Uniform sequence better: Time interval aware data augmentation for sequential recommendation. AAAI Conf. Artif. Intell. 2023, 37, 4225–4232. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 5998–6008. [Google Scholar]
Li, Y.; Gao, C.; Luo, H.; Jin, D.; Li, Y. Enhancing hypergraph neural networks with intent disentanglement for session-based recommendation. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, Madrid, Spain, 11–15 July 2022; pp. 1997–2002. [Google Scholar]
Qiao, S.; Zhou, W.; Wen, J.; Zhang, H.; Gao, M. Bi-channel multiple sparse graph attention networks for session-based recommendation. In Proceedings of the 32nd ACM International Conference on Information and Knowledge Management, Birmingham, UK, 21–25 October 2023; pp. 2075–2084. [Google Scholar]

Figure 1. The overall structure of EGDLTP-CCL.

Figure 2. The performance of the EGDLTP-CCL model and its variants.

Figure 3. Impact of the number of diffusion layers.

Figure 4. Impact of the learning rate.

Figure 5. Impact of the

λ_{L}

.

Figure 5. Impact of the

λ_{L}

.

Figure 6. Impact of the epoch.

Table 1. Statistical of datasets.

Datasets	# Number of Items	# Number of Hits	# Number of Training Sessions	# Number of Test Sessions	Average Length
Yoochoose	16,766	557,248	369,859	55,898	6.16
Retailrocket	36,968	710,586	433,648	15,132	5.43
Diginetica	43,097	982,961	719,470	60,858	5.12
Tmall	40,728	818,479	351,268	25,898	6.69

Table 2. Comparisons of the performance.

Methods	Tmall		Diginetica		Yoochoose		RetailRocket
Methods	P@20	MRR@20	P@20	MRR@20	P@20	MRR@20	P@20	MRR@20
GRU4REC	10.98	5.92	29.98	8.92	60.64	22.89	44.01	23.67
NARM	23.35	10.68	44.35	15.68	68.32	28.63	50.22	24.59
STAMP	26.44	13.35	45.44	14.32	68.74	29.67	50.96	25.17
SR-GNN	27.65	13.76	50.26	17.26	70.57	30.94	50.32	26.57
TAGNN	29.26	13.56	51.33	17.90	71.02	31.12	52.06	18.22
DHCN	31.51	15.08	53.18	18.44	70.39	29.92	53.66	27.30
Disen-GNN	31.56	15.31	53.79	18.99	71.46	31.36	47.44	29.32
HIDE	37.12	18.69	53.68	18.36	70.33	30.66	51.33	28.89
MSGAT	40.14	23.35	55.68	19.22	71.66	31.46	53.88	29.76
EGDLTP-CCL	43.27	26.10	56.27	19.89	71.78	31.56	55.65	30.12

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Li, J.; Gao, R.; Yan, L.; Yao, Q.; Peng, X.; Hu, J. Enhanced Graph Diffusion Learning with Transformable Patching via Curriculum Contrastive Learning for Session Recommendation. Electronics 2025, 14, 2089. https://doi.org/10.3390/electronics14102089

AMA Style

Li J, Gao R, Yan L, Yao Q, Peng X, Hu J. Enhanced Graph Diffusion Learning with Transformable Patching via Curriculum Contrastive Learning for Session Recommendation. Electronics. 2025; 14(10):2089. https://doi.org/10.3390/electronics14102089

Chicago/Turabian Style

Li, Jin, Rong Gao, Lingyu Yan, Quanfeng Yao, Xianjun Peng, and Jiwei Hu. 2025. "Enhanced Graph Diffusion Learning with Transformable Patching via Curriculum Contrastive Learning for Session Recommendation" Electronics 14, no. 10: 2089. https://doi.org/10.3390/electronics14102089

APA Style

Li, J., Gao, R., Yan, L., Yao, Q., Peng, X., & Hu, J. (2025). Enhanced Graph Diffusion Learning with Transformable Patching via Curriculum Contrastive Learning for Session Recommendation. Electronics, 14(10), 2089. https://doi.org/10.3390/electronics14102089

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Enhanced Graph Diffusion Learning with Transformable Patching via Curriculum Contrastive Learning for Session Recommendation

Abstract

1. Introduction

2. Related Work

2.1. Deep Learning Based Session Recommendation

2.2. Application of Contrastive Learning in Recommendation

3. EGDLTP-CCL Model

3.1. Formulation

3.2. Inter-Session Item Transition Pattern Representation Learning

3.3. Intra-Session Item Transition Pattern Representation Learning

3.4. Curriculum Contrastive Representation Learning

3.4.1. Sampling of Curriculum Learning

3.4.2. The Loss of Contrastive Learning Framework

3.5. Prediction and Optimization Based on Fusion Representation of Multi-Task Strategies

3.5.1. Prediction

3.5.2. Optimization

3.5.3. Time Complexity

4. Experiment

4.1. Datasets

4.2. Metric

4.3. Experimental Setup

4.4. Analysis of Experimental Results

4.4.1. Overall Comparison

4.4.2. Ablation Analysis

4.4.3. Parameter Sensitivity Analysis

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI