Debiasing Session-Based Recommendation for the Digital Economy: Propensity-Aware Training and Temporal Contrast on Graph Transformers

Wang, Yongjian; Si, Junru; Qiu, Xuhua; Zhu, Kunjie

doi:10.3390/electronics15010084

Open AccessArticle

Debiasing Session-Based Recommendation for the Digital Economy: Propensity-Aware Training and Temporal Contrast on Graph Transformers

¹

School of Foreign Languages, Guangdong Polytechnic Normal University, Guangzhou 510640, China

²

Institutes of Science and Development, Chinese Academy of Sciences, Beijing 100190, China

³

Department of Mathematics, City University of Hong Kong, Hong Kong, China

^*

Authors to whom correspondence should be addressed.

Electronics 2026, 15(1), 84; https://doi.org/10.3390/electronics15010084

Submission received: 13 November 2025 / Revised: 5 December 2025 / Accepted: 22 December 2025 / Published: 24 December 2025

(This article belongs to the Special Issue Advances in Deep Learning for Graph Neural Networks)

Download

Browse Figures

Versions Notes

Abstract

Session-based recommender systems (SBRs) are critically impaired by exposure bias in observational training logs, causing models to overfit to logging policies rather than true user preferences. This bias distorts offline evaluation and harms generalization, particularly for long-tail items. To address this, we propose the Propensity- and Temporal-consistency Enhanced Graph Transformer (PTE-GT), a principled framework that enhances a recent interval-aware graph transformer backbone with two synergistic training-time modules. This Graph Neural Network -based architecture is adept at modeling the complex, graph-structured nature of session data, capturing intricate item transitions that sequential models might miss. First, we introduce a propensity-aware (PA) optimization objective based on the self-normalized inverse propensity scoring (SNIPS) estimator. This module leverages logs containing randomized exposure or logged behavior-policy propensities to learn an unbiased risk estimate, correcting for the biased data distribution. Second, we design a lightweight, view-free temporal consistency (TC) contrastive regularizer that enforces alignment between session prefixes and suffixes, improving representation robustness without computationally expensive graph augmentations, which are often a bottleneck for graph-based contrastive methods. We conduct comprehensive evaluations on three public session-based benchmarks—KuaiRand, the OTTO e-commerce challenge dataset (OTTO), and the YOOCHOOSE-1/64 split (YOOCHOOSE)—and additionally on the publicly available Open Bandit Dataset (OBD) containing logged bandit propensities. Our results demonstrate that PTE-GT significantly outperforms strong baselines. Critically, on datasets with randomized exposure or logged propensities, our unbiased evaluation protocol, using SNIPS-weighted metrics, reveals a substantial performance leap that is masked by standard, biased metrics. Our method also shows marked improvements in model calibration and long-tail item recommendation.

Keywords:

graph transformer; session-based recommendation; debiasing; GNNS; inverse propensity scoring; contrastive learning; randomized exposure; digital economy

1. Introduction

Session-based recommender (SBR) systems are a critical component of modern online platforms, including e-commerce, video streaming, and content delivery [1,2,3]. Their primary goal is to predict a user’s next action, such as the next item to click or purchase, based solely on the sequence of interactions within their current, often anonymous, session. Unlike traditional collaborative filtering, SBR models must operate without access to long-term user profiles or explicit user identifiers, making them essential for handling new users and respecting user privacy [4].

To capture the complex dependencies within these short-term interaction sequences, the field has increasingly shifted towards powerful deep learning architectures. Early successes with recurrent neural networks (RNNs) were followed by the adoption of Transformer models, such as the Self-Attentive Sequential Recommendation (SASRec) architecture [5], which leverages self-attention to capture long-range dependencies between items in a sequence. Concurrently, Graph Neural Networks (GNNs) have demonstrated significant efficacy. This paradigm is motivated by the observation that user behavior is often not strictly linear. GNNs address this by modeling a session as an item-transition graph [6,7,8]. This graph-based approach, exemplified by the Session-based Recommendation Graph Neural Network (SR-GNN), allows the model to explicitly capture complex, non-sequential relationships and structural patterns, such as revisiting items or exploring divergent paths. By propagating information along the graph’s edges, GNNs can model dependencies between items that are distant in time but semantically related, a task that remains challenging for purely sequential models. Recent work has sought to integrate these paradigms, resulting in advanced backbones like graph transformers that leverage GNN message passing within an attention-based architecture, jointly modeling sequential context and structural connectivity [9]. Subsequent works further refined these ideas by incorporating target-aware attention mechanisms [10], dual-channel hypergraph modeling [11], and lossless encoding schemes [12].

Despite their modeling sophistication, these systems are almost universally trained on observational data that suffers from a fundamental flaw: exposure bias [13,14,15]. The interaction logs collected from production systems do not represent a uniform sample of true user preferences. Instead, they reflect a heavily biased feedback loop, where items displayed at higher ranks or with greater frequency are clicked more often, simply because they are seen more. This bias is inherent in the logging policy. Models trained naively on such data learn to conflate this visibility-driven popularity with genuine user relevance [16,17]. Consequently, these models tend to overfit to the biases of the logging policy, leading to sub-optimal recommendations, poor generalization to new or long-tail items, and a significant distortion of offline evaluation metrics.

In the broader field of Learning-to-Rank (LTR), mitigating such biases through counterfactual learning is a well-established practice [18]. Methods based on inverse propensity scoring (IPS) can, under certain assumptions, correct for the biased data distribution by re-weighting the training loss for each sample [19]. This allows the model to optimize for an unbiased risk estimate. However, the application of these principled debiasing techniques in the session-based recommendation domain has been severely limited. This is particularly true for complex Graph Neural Network and Transformer models, where the non-trivial architectures and the graph-based data representation add complexity to the integration of sample-wise propensity scores.

The emergence of public datasets containing logs from randomized controlled trials, such as KuaiRand, provides a unique pathway to address this challenge [20,21]. By intentionally inserting uniformly random items into recommendation feeds, these datasets make it possible to estimate exposure propensities from data, thereby enabling the use of counterfactual estimation techniques.

In this work, we leverage this capability to develop a principled framework for debiasing session-based graph recommendation. We adopt a recent Interval-enhanced Graph Transformer as our backbone. This choice is deliberate, as the graph transformer architecture is uniquely suited to SBR by inheriting the structural reasoning power of GNNs while also using self-attention to capture long-range, time-aware dependencies. It is chosen for its ability to effectively model both the graph structure and the precise temporal intervals between events in a session. We integrate this backbone with two key training-time modules. First, we introduce a propensity-aware optimization objective based on the self-normalized inverse propensity scoring estimator, SNIPS, which uses propensities learned from randomized exposure data to correct for bias. Second, to further improve the quality and robustness of the learned session representations, we propose a lightweight, view-free temporal consistency contrastive regularizer. This module encourages the model to learn invariant representations for temporally adjacent sub-sessions (prefixes and suffixes) without relying on computationally expensive graph augmentations, which are a common bottleneck for GNN-based contrastive learning.

Our contributions are three-fold:

We propose a principled, propensity-aware training objective grounded in unbiased counterfactual risk minimization, adapting the self-normalized inverse propensity scoring estimator for the training of an interval-aware graph transformer for session-based recommendation.
We design a view-free temporal consistency contrastive objective that regularizes session representations by enforcing alignment between session prefixes and suffixes, improving robustness without the computational overhead of structural data augmentations, a significant challenge for graph-based models.
We conduct a comprehensive evaluation on four public datasets—KuaiRand, OTTO, YOOCHOOSE, and the Open Bandit Dataset (OBD)—demonstrating the efficacy of our approach using both standard top-K metrics and unbiased evaluation protocols enabled by randomized exposure or logged behavior-policy propensities.

The remainder of this paper is structured as follows. We first review the related literature in session-based recommendation, unbiased learning, and contrastive methods. We then describe the datasets used in our evaluation, including those with randomized exposure. Following this, we detail our proposed methodology, covering the interval-aware graph transformer backbone, the propensity-aware learning objective, and the temporal consistency regularizer. Subsequently, we outline the experimental setup and present our comprehensive results, including performance on standard benchmarks, unbiased evaluations, and analyses of model calibration and long-tail performance. We then provide ablation studies to validate our design choices. Finally, we conclude the paper with a discussion of our findings.

2. Related Work

Session-based recommendation has progressed from recurrent models like GRU4Rec to more complex graph-based and Transformer-based architectures [22]. GNNs for SBR, such as the canonical SR-GNN, model sessions as item-transition graphs, capturing structural patterns [23]. Subsequent works further refined this approach by incorporating target-aware attention mechanisms [10], dual-channel hypergraph modeling [11], and lossless encoding schemes [12].

Concurrently, Transformer models became powerful non-graph alternatives. SASRec applied uni-directional self-attention, while Bidirectional Encoder Representations from Transformers for Recommendation (BERT4Rec) used bidirectional masking. Industrial-scale models like TRON (the production Transformer baseline released for the OTTO challenge) demonstrated the efficiency and accuracy of optimized Transformers on large datasets like OTTO, establishing strong, reproducible baselines [24]. Recently, graph transformers, such as the Interval-enhanced Graph Transformer (IGT), have emerged to integrate structural graph information with temporal dynamics by encoding inter-event time intervals directly into the attention mechanism, aligning well with the time-aware nature of SBR [25].

A primary challenge in SBR is the inherent exposure bias in logged implicit feedback. The field of unbiased LTR provides a principled foundation for addressing this, primarily through IPS and its self-normalized variant SNIPS [26,27]. These methods re-weight training samples to correct for biased exposure, enabling the optimization of an unbiased risk estimate. Adapting these counterfactual learning techniques to recommender systems, particularly for debiasing from missing-not-at-random feedback, has become an active area of research [28].

A critical enabler for such debiasing is access to data with randomized exposure or logged behavior-policy propensities. The KuaiRand dataset, by inserting uniformly random items into feeds, provides the necessary ground truth for estimating propensities and performing reliable, SNIPS-based unbiased offline evaluation. Complementary to this, the Open Bandit Dataset (OBD) offers real-world logged bandit feedback with known action probabilities (“propensity_score”) under multiple behavior policies, enabling direct off-policy evaluation and unbiased weighting [29,30]. This complements large-scale observational datasets like OTTO, which serves as a standard benchmark for session model performance under biased, real-world traffic [20,21].

Finally, contrastive learning (CL) has been widely adopted as a regularization technique. Early methods like Self-supervised Graph Learning (SGL) applied graph augmentations to create different views for contrast [31]. However, recent work such as Simple Graph Contrastive Learning (SimGCL) has questioned the necessity of complex structural augmentations, showing that simple embedding-space noise can be sufficient [32]. This motivates “view-free” approaches, which avoid the computational cost and potential artifacts of heavy augmentations. In parallel, consistency-training objectives that enforce agreement between multiple stochastic views of the same sequence have been shown to be effective for sequential recommendation without explicit data augmentations [33]. For discrete event sequences, CoLES [34] formalizes a prefix–suffix style subsequence cropping scheme and proves that, under mild assumptions, such augmentations preserve the underlying latent generative process, providing a theoretical footing for sequence-level contrastive and consistency training.

Beyond classical IPS and SNIPS, a growing line of work applies causal inference and doubly robust techniques to recommendation. CausalRec [35] introduces a front-door style visual causal adjustment to mitigate presentation bias in visually aware recommendation. Doubly robust (DR) and multiple robust (MR) estimators [36,37] combine IPS with reward modeling to reduce variance and improve robustness to propensity misspecification. Recent surveys on causal recommendation [38,39] further systematize potential-outcome-based, SCM-based, and general counterfactual approaches. Our PTE-GT is orthogonal to these architectures: it can be viewed as a plug-in, SNIPS-based training objective for an interval-aware graph transformer backbone. In Section 6.2, we empirically compare PTE-GT against IPS-enhanced SASRec, a CausalRec-style baseline, and a doubly robust model, and find that PTE-GT achieves consistently higher SNIPS-weighted accuracy on both KuaiRand and OBD.

Our work synthesizes these three advancements. We utilize the interval-aware graph transformer IGT as our backbone, apply the SNIPS counterfactual training objective enabled by randomized exposure data, and introduce a lightweight, view-free temporal consistency regularizer tailored for session graphs.

3. Datasets

We evaluate on three widely used public session-based datasets (KuaiRand, OTTO, and YOOCHOOSE-1/64) and additionally on the publicly available Open Bandit Dataset (OBD), which provides logged bandit feedback with known propensities. KuaiRand provides intervened logs with randomized exposure suitable for unbiased training and evaluation; OTTO offers industry-grade, multi-objective sessions at scale; and YOOCHOOSE-1/64 remains a canonical session benchmark with widely reported statistics. OBD supplies impression-level logs with behavior-policy propensity scores suitable for off-policy evaluation and debiased training. Table 1 summarizes the descriptive statistics. KuaiRand’s statistics are reported at the user-sequence level, as they are released as continuous user histories rather than pre-segmented sessions.

KuaiRand aggregates rich feedback signals from a large video platform and, crucially, inserts random items into normal feeds so that a small share of impressions is assigned uniformly at random. The official documentation reports three releases; we use KuaiRand-27K as our default and draw unbiased evaluation slices from logs flagged as randomized. The site specifies that approximately 0.37% of interactions in the final data are replaced by randomly sampled items, and it lists per-release counts for users, items, and interactions that we adopt here. These properties make KuaiRand suitable for propensity estimation and SNIPS-weighted evaluation.

OTTO is a real-world e-commerce session dataset collected from an online retailer and published with a Kaggle task. The maintainers provide aggregate counts for the training split: 12.9 million sessions, 1.86 million unique items, and 216.7 million events, with the event mix broken down into clicks, cart additions, and orders, and with descriptive distributional statistics such as a mean of 16.80 events per session. We follow the single-objective next-click protocol when aligning with session-based literature.

YOOCHOOSE-1/64 is the most common compact split of the RecSys-2015 challenge dataset used for session-based recommendation. Following the authoritative SR-GNN study, we report 369,859 training sessions and 55,898 test sessions, with 16,766 unique items, 557,248 click events, and an average session length of 6.16. We preserve the standard preprocessing that removes length-1 sessions and constructs next-item labels via prefix expansion.

Open Bandit Dataset (OBD). We additionally use the Open Bandit Dataset, a large-scale logged bandit dataset collected from the ZOZOTOWN fashion e-commerce platform. OBD provides impression-level logs with multiple behavior policies and records the action probabilities as a “propensity_score” field, along with click indicators. We use the “all” campaign and follow the official release guidance for preprocessing; this dataset is well suited for off-policy evaluation and propensity-aware training [29,40]. In our preprocessed split, the logged action set size

| A_{i} |

is almost always three: 25.48 M impressions (97.99%) have exactly three candidate actions, 0.41 M (1.58%) have two, and only 0.11 M (0.42%) have a single action due to logging artefacts. To adapt this impression-level bandit data to our session-style encoder, we sort impressions chronologically for each user and construct a sliding context of the last w impressions. The session encoder

f_{θ}

receives the sequence of the last w (item, timestamp) pairs and outputs a context embedding, which is then scored against the three candidate actions in

A_{i}

. We choose a window size of

w = 10

based on validation performance on OBD, as discussed in Section 7.3. For each impression i with logged action set

A_{i}

and click label, we score all

a \in A_{i}

using the same encoder fed with the last w impressions as context. Top-K is computed over

A_{i}

; the clicked action (if any) serves as the ground truth. SNIPS weights

\frac{1}{propensity_{score}_{i}}

are applied to the indicator terms in Recall@K and to the reciprocal rank in MRR@K, with self-normalization across the evaluation set.

4. Methodology

This section presents a principled framework for debiasing session-based recommendation with a recent graph-transformer backbone and two training-time modules. The pipeline consists of (i) constructing a temporally enriched session graph and encoding it with an interval-aware graph transformer; (ii) learning under inverse-propensity weighting estimated from logs with randomized exposure; and (iii) regularizing the session representation by a view-free temporal consistency objective. Figure 1 illustrates the end-to-end framework, Figure 2 details the propensity-aware learning flow, and Figure 3 depicts the temporal consistency mechanism.

4.1. Session Graph Construction and Interval-Aware Graph Transformer

Our session encoder is based on the Interval-enhanced Graph Transformer (IGT) architecture [25]. Concretely, we adopt the same interval-aware multi-head attention mechanism and Fourier time encoding as in IGT but restrict the neighborhood

N (i)

to one-step predecessors and successors in the session graph and use gated time-aware readout in place of the original global pooling.

Let a user session be an ordered sequence

S = (x_{1}, \dots, x_{T})

of item interactions with timestamps

(t_{1}, \dots, t_{T})

. We construct a directed multigraph

G = (V, E)

per session, where

V = {x_{1}, \dots, x_{T}}

and

E = {(x_{ℓ} \to x_{ℓ + 1})}_{ℓ = 1}^{T - 1}

. Each edge carries a time interval feature

Δ t_{ℓ} = t_{ℓ + 1} - t_{ℓ}

and a relative rank feature

Δ p_{ℓ} = 1

. Node features are

h_{ℓ}^{(0)} = e (x_{ℓ}) + p (ℓ)

, where

e (\cdot)

is the learnable item embedding and

p (ℓ)

is a learnable position embedding within the session.

To encode temporal recency and nonlinearity of inter-event gaps, we map

Δ t

to a Fourier time feature

\begin{matrix} r {(Δ t)}_{2 i - 1} & = cos (ω_{i} log (1 + Δ t)) \\ r {(Δ t)}_{2 i} & = sin (ω_{i} log (1 + Δ t)) \end{matrix} for i = 1, \dots, m

(1)

followed by a linear projection

\tilde{r} = W_{r} r (Δ t) \in R^{d}

. The interval-aware graph transformer layer updates node representations with multi-head attention over one-step predecessors and successors. For a head

a = 1, \dots, H

, the pre-softmax attention energy from j to i is

e_{i j}^{(a)} = \frac{{(Q^{(a)} h_{i})}^{⊤} (K^{(a)} h_{j})}{\sqrt{d / H}} + u^{(a) ⊤} {\tilde{r}}_{i j} + v^{(a) ⊤} {\tilde{p}}_{i j},

(2)

where

{\tilde{r}}_{i j} = W_{r} r (Δ t_{i j})

,

{\tilde{p}}_{i j} = W_{p} onehot (Δ p_{i j})

, and

Q^{(a)}, K^{(a)} \in R^{(d / H) \times d}

are head-specific projections. Attention coefficients are

α_{i j}^{(a)} = \frac{exp (e_{i j}^{(a)})}{\sum_{k \in N (i)} exp (e_{i k}^{(a)})}, z_{i}^{(a)} = \sum_{j \in N (i)} α_{i j}^{(a)} V^{(a)} h_{j},

(3)

and the head outputs are concatenated and passed through a position-wise feed-forward network with residual connections and layer normalization:

h_{i}^{'} = LN (h_{i} + Concat (z_{i}^{(1)}, \dots, z_{i}^{(H)})), h_{i}^{(l + 1)} = LN (h_{i}^{'} + FFN (h_{i}^{'})) .

(4)

Here,

LN (\cdot)

denotes layer normalization and

FFN (\cdot)

is a position-wise feed-forward network. After L layers we obtain node states

h_{i}^{(L)}

. The session representation is a gated readout with time-aware weights:

γ_{i} = σ (w^{⊤} h_{i}^{(L)} + b log (1 + κ Δ t_{i}^{now})), s = \sum_{i = 1}^{T} \frac{γ_{i}}{\sum_{k} γ_{k}} h_{i}^{(L)},

(5)

where

Δ t_{i}^{now}

is the recency gap between item

x_{i}

and the session end. The next-item distribution is

p_{θ} (y ∣ S) = softmax (W_{o} s)

.

4.2. Propensity-Aware Learning Objective

Logs collected under ranked exposure are biased. Let

D = {(S_{n}, y_{n}, a_{n})}_{n = 1}^{N}

be training triples with observed next item

y_{n}

and logging action

a_{n}

indicating that

y_{n}

is displayed and clicked under the production policy. The unbiased Learning-to-Rank risk for cross-entropy loss

ℓ_{θ} (S, y) = - log p_{θ} (y ∣ S)

can be estimated by inverse propensity scoring:

{\hat{L}}_{IPS} (θ) = \frac{1}{N} \sum_{n = 1}^{N} \frac{I (a_{n} = 1)}{{\hat{p}}_{n}} ℓ_{θ} (S_{n}, y_{n}), {\hat{p}}_{n} \equiv \hat{p} (a = 1 ∣ S_{n}, y_{n}) .

(6)

To stabilize variance we adopt the self-normalized estimator

{\hat{L}}_{SNIPS} (θ) = \frac{\sum_{n = 1}^{N} w_{n} ℓ_{θ} (S_{n}, y_{n})}{\sum_{n = 1}^{N} w_{n}}, w_{n} = min (c, \frac{1}{{\hat{p}}_{n}}),

(7)

with clipping threshold

c > 0

. We further penalize weight dispersion to reduce the impact of rare large weights:

R_{var} = λ \cdot \frac{1}{N} \sum_{n = 1}^{N} {(w_{n} - \bar{w})}^{2}, \bar{w} = \frac{1}{N} \sum_{n = 1}^{N} w_{n} .

(8)

In datasets that provide randomized exposure or logged behavior-policy propensities, the estimation of

{\hat{p}}_{n}

is straightforward. For KuaiRand, we estimate exposure probabilities from the randomized insertion protocol within the logs. For OBD, we directly use the logged behavior-policy probabilities, setting

{\hat{p}}_{n} = propensity_score

as recorded in the dataset, which obviates the need for a fitted exposure model and enables principled SNIPS-weighted training and evaluation. For diagnostic analysis in Section 7, we additionally evaluate the restricted and MLP-based propensity models on OBD by replacing the logged propensities with their predictions when computing the SNIPS weights, allowing us to study the impact of model capacity on calibration, variance, and downstream debiased performance. The supervised objective is

J_{\sup} = {\hat{L}}_{SNIPS} + R_{var}

.

Propensity estimation from randomized exposure. On logs containing randomized insertions, a binary indicator

z_{n} \in {0, 1}

marks whether item display at position k was assigned at random with known probability

ρ_{k}

. We model the exposure propensity as a convex mixture

{\hat{p}}_{n} = z_{n} ρ_{k_{n}} + (1 - z_{n}) q_{ϕ} (S_{n}, y_{n}, k_{n}),

(9)

where

q_{ϕ}

is a parametric exposure model fitted on the randomized slice by maximum likelihood. A simple and effective choice is a position-and-context model

q_{ϕ} (S, y, k) = σ (α_{k} + β^{⊤} f (S) + γ^{⊤} g (y)),

(10)

with

σ

as the logistic link,

f (S)

session covariates such as length, dwell-time statistics and recency histogram, and

g (y)

item covariates such as global popularity and category. When transferring to datasets without randomized insertions,

q_{ϕ}

is applied directly, and calibration is performed by matching the empirical display rate per rank.

In practice, we instantiate three propensity models of increasing capacity on the randomized slices of KuaiRand and OBD: (i) a restricted logistic regression (LR-Restrict) that uses only rank, log global popularity, and a small set of campaign indicators as features; (ii) a full logistic regression (LR-Full) that augments these with richer session-level covariates

f (S)

(session length, log-recency of the last click, mean and variance of dwell times over the last three interactions, and a coarse histogram of inter-arrival times) and item-level covariates

g (y)

(log global exposure count, log click count, and a learned projection of the item’s categorical ID); and (iii) a two-layer MLP (MLP-2layer) with 64 hidden units that takes the same feature vector as input. All three models are trained by maximum likelihood on randomized impressions, and their calibrated propensities are then plugged into the SNIPS objective to train PTE-GT on KuaiRand and OBD.

Unbiased evaluation. For a held-out set with randomized insertions, we compute SNIPS-weighted Recall@K and MRR@K by replacing each indicator with its IPS-weighted counterpart, ensuring that offline evaluation does not inflate scores due to rank bias. Figure 2 summarizes the estimation and weighting flow. Propensity model comparisons are summarized in Table 2.

4.3. Temporal Consistency Contrast Without Heavy Augmentations

To improve robustness and long-tail generalization, we impose a view-free consistency constraint on session representations. For each session S we form a prefix

S^{pre} = (x_{1}, \dots, x_{τ})

and a suffix

S^{suf} = (x_{τ + 1}, \dots, x_{T})

with

τ \sim Unif {2, \dots, T - 2}

. Passing each through the same encoder yields embeddings

s^{pre}, s^{suf} \in R^{d}

. We add a small latent perturbation

ϵ \sim N (0, σ^{2} I)

to both views to avoid collapse while avoiding structural graph augmentations. Intuitively, this loss encourages the encoder to map different contiguous subsequences (prefix and suffix) of the same session to nearby points in the representation space, while pushing apart representations of subsequences coming from different sessions. The Info Noise-Contrastive Estimation (InfoNCE) loss with in-batch negatives is

L_{con} = \frac{1}{B} \sum_{i = 1}^{B} ω_{i} [- log \frac{exp (〈 {\tilde{s}}_{i}^{pre}, {\tilde{s}}_{i}^{suf} 〉 / τ)}{\sum_{j = 1}^{B} exp (〈 {\tilde{s}}_{i}^{pre}, {\tilde{s}}_{j}^{suf} 〉 / τ)}], \tilde{s} = \frac{s + ϵ}{{∥ s + ϵ ∥}_{2}} .

(11)

Temporal structure is emphasized by a decay weight

ω_{i} = \frac{exp (- η Δ t_{i}^{gap})}{\frac{1}{B} \sum_{j = 1}^{B} exp (- η Δ t_{j}^{gap})},

(12)

where

Δ t_{i}^{gap}

is the time span between the prefix end and suffix start of session i. This weighting strengthens alignment when the two views are temporally close and relaxes it when they are far apart. Figure 3 shows the prefix–suffix construction and contrastive alignment.

The total training objective combines debiased supervision and temporal consistency:

J (θ) = J_{\sup} (θ) + λ_{con} L_{con} (θ),

(13)

with

λ_{con} > 0

tuned on a validation split.

Our prefix–suffix temporal consistency objective can be viewed as a form of consistency training on sequential data. Assume that each session S is generated from a latent intent variable z, and that both the prefix

S^{pre}

and suffix

S^{suf}

are conditionally independent given z. Under this assumption, an encoder

f_{θ}

that maps S to a representation

s = f_{θ} (S)

which is invariant across

(S^{pre}, S^{suf})

pairs implicitly learns a sufficient statistic for z, in the sense that s preserves the information about the latent intent while discarding view-specific noise. Recent work on consistency training for sequential recommendation [33] and contrastive learning on event sequences [34] formalized similar intuitions, showing that enforcing agreement between multiple views (or subsequences) of the same sequence can reduce representation variance and improve generalization without requiring heavy augmentations. Our temporal consistency loss in Equation (11) follows this line of reasoning: it penalizes mutual information gaps between the representations of

S^{pre}

and

S^{suf}

, while our ablation results in Section 7 empirically confirm that adding TC on top of the SNIPS-based PA module consistently improves both unbiased accuracy and long-tail recall. We do not claim a new convergence theorem but rather place our design within this established theoretical framework and support it with targeted empirical evidence.

4.4. Training Algorithm and Complexity

The model is trained by mini-batch stochastic optimization over sessions. Within a batch, we compute session graphs, interval-aware attention, SNIPS weights, the debiased cross-entropy loss, and the contrastive loss. Inference uses the encoder once per session and a single linear projection to produce top-K items.

Algorithm 1 summarizes the training procedure.

Computational cost. For a session of length T, one graph-transformer layer attends over

| N (i) | \leq 2

neighbors when restricted to immediate transitions, yielding

O (H T d)

complexity per layer, linear in session length. The propensity computation and SNIPS weighting are

O (B)

per batch. The contrastive loss introduces an in-batch softmax of cost

O (B d)

. The overall training throughput is dominated by encoder forward passes; memory scales linearly with T under per-session graphs.

Algorithm 1 Debiased session-graph training with temporal consistency.

1:: Inputs: training logs D, randomized slice R, rank-wise random rates ${ρ_{k}}$ , encoder $f_{θ} (\cdot)$ , exposure model $q_{ϕ} (\cdot)$ , temperature $τ$ , contrast weight $λ_{con}$ , clip c, variance $λ$ .
2:: Fit exposure model $q_{ϕ}$ on R by maximizing likelihood of observed displays.
3:: for each training epoch do
4:: for each mini-batch $B = {(S, y, a, k)}$ do
5:: ▹ Forward pass
6:: $s \leftarrow f_{θ} (S)$ ▹ session embedding via interval-aware graph transformer
7:: $p \leftarrow softmax (W_{o} s)$ ▹ next-item distribution
8:: ▹ Propensity weights
9:: $\hat{p} \leftarrow z \cdot ρ_{k} + (1 - z) \cdot q_{ϕ} (S, y, k)$ ▹ where z indicates randomized assignment
10:: $w \leftarrow min (c, 1 / \hat{p})$ ; $\hat{w} \leftarrow w / mean (w)$
11:: $L_{\sup} \leftarrow (\sum \hat{w} \cdot CE (p, y)) / (\sum \hat{w}) + λ \cdot Var (w)$
12:: ▹ Temporal consistency
13:: sample $τ$ uniformly; form prefix $S_{pre}$ and suffix $S_{suf}$ from S
14:: $s_{pre} \leftarrow f_{θ} (S_{pre}); s_{suf} \leftarrow f_{θ} (S_{suf})$
15:: add Gaussian noise to $s_{pre}, s_{suf}$ ; $ℓ_{con} \leftarrow InfoNCE (s_{pre}, s_{suf}; τ)$
16:: ▹ Total loss and update
17:: $L \leftarrow L_{\sup} + λ_{con} \cdot ℓ_{con}$
18:: update $θ$ by AdamW on $\nabla_{θ} L$
19:: end for
20:: end for

5. Experimental Setup

5.1. Dataset Preprocessing and Evaluation Metrics

We follow the standard public preprocessing protocols for each dataset. For YOOCHOOSE-1/64, this involves filtering out sessions of length one and applying prefix expansion to create labeled sequences.

Our primary evaluation metrics are Recall@K, Normalized Discounted Cumulative Gain (NDCG@K), and Mean Reciprocal Rank (MRR@K), where we report K = 10 and K = 20. In our results tables, we abbreviate Recall@K, NDCG@K, and MRR@K as R@K, N@K, and M@K, respectively, for brevity. For the OTTO dataset, we also report the official single-objective click metrics to ensure a fair comparison with published baselines. On datasets containing randomized interventions or logged propensities (KuaiRand and OBD), we conduct an unbiased evaluation, reporting metrics weighted by SNIPS. We denote the SNIPS-weighted Recall@K and MRR@K as SNIPS-R@K and SNIPS-M@K. On standard observational datasets (OTTO and YOOCHOOSE), we report the standard (biased) top-K metrics. Beyond accuracy, we analyze model calibration using the Expected Calibration Error (ECE) and Brier Score, and assess performance on long-tail item distributions.

For OBD, we use the “all” campaign, filter impressions where the click indicator is defined, and treat each impression as a training instance. We use the provided propensity_score field as the behavior-policy exposure probability for both training (SNIPS-weighted objective) and evaluation (SNIPS-weighted metrics). As OBD is impression-based rather than session-based, we adapt next-item prediction to the impression scenario by conditioning on short impression contexts when applicable and aligning the evaluation protocol with off-policy evaluation practice.

Unless otherwise stated, all reported results are averaged over five random seeds. For clarity of presentation, we report mean values in the main performance and unbiased evaluation tables and summarize standard deviations and statistical significance for key comparisons in a dedicated table devoted to statistical tests. We compute paired t-tests for standard metrics and cluster bootstrap confidence intervals for SNIPS-weighted metrics between PTE-GT and the strongest baseline on each dataset; the improvements of PTE-GT over TRON on OTTO and over IGT-Base on KuaiRand are highly significant (

p < 10^{- 4}

), while the gains on YOOCHOOSE and OBD are also statistically significant with

p < 0.05

.

5.2. Baselines and Implementation Details

We compare our proposed PTE-GT model against a suite of strong baselines, including: (1) GRU4Rec+, a widely cited RNN-based model [41]; (2) SASRec, a representative Transformer-based sequential model [5]; (3) SR-GNN, a canonical GNN-based session model [23]; (4) TRON, a strong industrial Transformer baseline from the OTTO challenge [24]; and (5) IGT-Base, our backbone interval-aware graph transformer trained with standard cross-entropy loss [25]. For the unbiased debiasing comparison on KuaiRand and OBD, we additionally consider two causal/DR-style baselines: (6) CausalRec-adapt, which follows the front-door debiasing idea of CausalRec [35] but uses SASRec as a backbone and replaces visual features with item popularity and category features as proxies for exposure; and (7) DR-JL, a doubly robust learning variant that combines IPS with a separate reward model in the spirit of DR estimators [36,37], again instantiated on top of SASRec.

Software and Computing Environment

All experiments were implemented in Python (3.11) and PyTorch (2.9.0). Training was conducted on an NVIDIA A100 GPU with CUDA (11.1) and cuDNN (v8.9.7).

All models are implemented in PyTorch and trained on a single NVIDIA A100 graphics processing unit (GPU). We use the AdamW optimizer for all models. The learning rate is selected via grid search from

{1 \times 10^{- 3}, 5 \times 10^{- 4}, 1 \times 10^{- 4}}

and the batch size is set to 512. The item and position embedding dimension is set to 128 across all models for fair comparison. For our PTE-GT model, the IGT backbone uses

L = 2

graph transformer layers and

H = 4

attention heads. Based on validation set tuning, the SNIPS clipping threshold c is set to 20, and the temporal consistency temperature

τ

is set to 0.1.

6. Results

This section details our experimental results on four datasets, including three public session-based benchmarks—OTTO, YOOCHOOSE-1/64, and KuaiRand—and the Open Bandit Dataset (OBD). We compare our proposed model, PTE-GT, with a series of strong baselines.

6.1. Main Performance on Public Benchmarks

We first evaluate the standard top-K recommendation accuracy of our models on two widely used public benchmarks, OTTO and YOOCHOOSE-1/64. OTTO represents a large-scale, sparse, real-world e-commerce scenario, while YOOCHOOSE-1/64 is a smaller, standard benchmark widely used for comparison. Results are summarized in Table 3.

The results in Table 3 demonstrate the effectiveness of PTE-GT on both datasets. IGT-Base refers to using only our interval-aware graph transformer backbone without propensity-aware training or temporal consistency regularization. On the YOOCHOOSE-1/64 dataset, all graph- and Transformer-based models, including SR-GNN, TRON, and IGT-Base, perform strongly, significantly outperforming GRU4Rec. Our IGT-Base backbone alone achieves state-of-the-art performance comparable to TRON. After adding our propensity-aware and temporal-consistency modules, our full model, PTE-GT, achieves the best performance on both R@20 and M@20, validating our approach on this standard benchmark.

On the much larger and more challenging OTTO dataset, this advantage is more pronounced. TRON is the strong baseline reported officially. Our IGT-Base backbone again slightly outperforms TRON. Our full model, PTE-GT, achieves a significant performance improvement, with a 4.0% relative increase in R@20 and a 5.0% relative increase in M@20 over TRON. This indicates that our training strategy is robust in handling the sparsity and inherent biases of large-scale, real-world data.

6.2. Unbiased Evaluation on Randomized/Logged-Propensity Datasets

Standard evaluation (as in Table 3) is conducted on observational logs, which can be influenced by exposure bias, thereby inflating performance. To more accurately measure the true performance of the models in an unbiased setting, we conduct unbiased evaluation on KuaiRand-27K and the Open Bandit Dataset (OBD). KuaiRand contains randomly exposed interactions, while OBD provides logged behavior-policy propensities, enabling us to use SNIPS to estimate unbiased evaluation metrics on both datasets. Table 4 summarizes the SNIPS-weighted Recall@20 and MRR@20 on these two datasets.

The first three rows in Table 4 correspond to “naive” models trained with standard (unweighted) objectives: SASRec and IGT-Base both suffer from severe exposure bias, and their SNIPS-R@20 scores are noticeably lower than those of their debiased counterparts. Adding IPS weighting on top of SASRec (SASRec-IPS) already improves the unbiased metrics, confirming the effectiveness of classical off-policy correction.

We then include two stronger debiasing baselines that are closer to recent causal recommendation work. CausalRec-adapt follows the front-door style debiasing idea of CausalRec but replaces visual features with item popularity and category features as proxies for exposure, and uses SASRec as the rating backbone. DR-JL is a doubly robust variant that combines IPS with a learned reward model, instantiated on top of SASRec, inspired by DR and multiple robust estimators [36,37]. Both methods outperform SASRec-IPS on KuaiRand and OBD; for instance, DR-JL reaches 12.54% SNIPS-R@20 on KuaiRand and 12.31% on OBD.

However, our full PTE-GT model still achieves the highest SNIPS-weighted accuracy on both datasets: 13.05% SNIPS-R@20 on KuaiRand and 12.74% on OBD, together with the best SNIPS-M@20. This shows that the gains of PTE-GT are not merely due to “adding IPS” but stem from the combination of an interval-aware graph transformer backbone, a carefully calibrated SNIPS objective, and the temporal consistency regularizer. The gaps between PTE-GT and the causal/DR baselines are substantially larger under unbiased evaluation than those observed on standard OTTO metrics, highlighting again that conventional offline evaluation tends to underestimate the value of principled debiasing.

6.3. Analysis of Recommendation Calibration

Beyond accuracy, the reliability of the predicted probabilities output by a recommender system (i.e., calibration) is also critical. A well-calibrated model’s predicted confidence (probability) should align with its actual accuracy. We use ECE and Brier Score to quantify model calibration, where lower is better for both. As shown in Table 5, the IGT-Base model trained without debiasing exhibits poor calibration (higher ECE) on both datasets. This is expected, as the model tends to be overconfident in its predictions for over-exposed popular items. In contrast, our PTE-GT model, through propensity-aware training, significantly reduces both ECE and Brier Score. For example, on the OTTO dataset, ECE was reduced from 5.12% to 2.85%, a 44.3% reduction.

The reliability diagrams in Figure 4a,c plot the empirical accuracy (y-axis) against the model’s average predicted probability (x-axis) for binned predictions. For IGT-Base, the curve consistently falls well below the perfect calibration diagonal, confirming the significant overconfidence first identified in Table 5. In stark contrast, the PTE-GT curve tracks the diagonal much more closely on both KuaiRand and OTTO, demonstrating superior calibration.

The confidence histograms in Figure 4b,d reveal the cause of this miscalibration. IGT-Base exhibits a common symptom of overconfidence: it pushes a large, unhealthy proportion of its predictions into the highest confidence bin (e.g., 0.9–1.0). However, as shown in the reliability plots in Figure 4a,c, its actual accuracy in this bin is far lower (e.g., 70–78%). Our PTE-GT model corrects this, pulling these overconfident predictions back and redistributing them into more appropriate, lower-confidence bins. This improved calibration is crucial for building trust and for downstream tasks like balancing exploration and exploitation in a production environment.

6.4. Performance on Long-Tail Items

Finally, we conduct a comprehensive analysis of the models’ performance on items of varying popularity, from popular “head” items to “long-tail” items. Exposure bias typically causes models to overfit to popular items, neglecting the long tail. We investigate this by evaluating performance across 10 item popularity deciles (decile 1 = top 10% most popular, decile 10 = 10% least popular).

The results are presented in the 2 × 4 matrix in Figure 5, which examines both Recall@20 for discovery breadth and MRR@20 for ranking quality across all four datasets.

The analysis reveals two critical findings. First, on all datasets, the baseline models (SASRec, IGT-Base) exhibit a sharp performance drop-off as items become less popular, confirming they are dominated by popularity bias. Our PTE-GT model consistently maintains a flatter curve, demonstrating superior performance on long-tail items (deciles 6–10).

Second, and most importantly, the figure highlights the limitations of standard (biased) evaluations. On OTTO and YOOCHOOSE (Figure 5a,b,e,f), the performance gap, while consistent, appears modest. However, under the principled, unbiased SNIPS evaluation on KuaiRand and OBD (Figure 5c,d,g,h), this gap widens substantially, especially in the long-tail deciles. Taken together, these results suggest that relying solely on observational logs can underestimate the benefits of debiasing, particularly for infrequent items.

This comprehensive 8-plot analysis indicates that our propensity-aware objective and temporal consistency regularization work in synergy, effectively mitigating popularity bias to enhance both recall-driven discovery and MRR-based ranking of long-tail items.

To provide a finer-grained view of debiased performance on datasets with unbiased propensities, Figure 6 plots SNIPS-weighted Recall@20 and MRR@20 by item popularity decile for KuaiRand and OBD, comparing IGT-Base and PTE-GT. On KuaiRand, PTE-GT improves SNIPS-R@20 from 18.3% to 18.9% in the head decile (1), but the relative gains grow steadily toward the tail, reaching +1.4 percentage points in the bottom-10% items (3.6% to 5.0%, approximately +39%). The corresponding SNIPS-M@20 in the tail rises from 1.51% to 2.10%, again about a 39% relative improvement. A similar pattern appears on OBD, where SNIPS-R@20 in decile 10 increases from 1.9% to 2.7% (roughly +42% relative) and SNIPS-M@20 from 0.80% to 1.13%. These curves indicate that PTE-GT not only narrows the gap on popular items but also achieves the largest proportional gains on the coldest items, which are typically the most affected by exposure bias.

7. Ablation and Analysis

To deeply understand the contributions of each component in our proposed PTE-GT model, we conduct a series of detailed ablation studies. We primarily investigate two core modules: (1) PA (Propensity-Aware Learning), the SNIPS-based propensity-aware objective function; and (2) TC (Temporal Consistency), our view-free temporal consistency contrastive regularizer. Our baseline model is IGT-Base, which uses only the interval-aware graph transformer backbone and is trained with a standard cross-entropy loss.

7.1. Impact of Core Components

We evaluate different variants of our model on the KuaiRand-27K and OTTO datasets. KuaiRand-27K allows us to use SNIPS-weighted metrics to assess unbiased performance, while OTTO represents a standard (biased) large-scale benchmark. We evaluate four key metrics: (1) SNIPS-R@20 (unbiased accuracy), (2) ECE (Expected Calibration Error, lower is better), (3) Long-Tail R@20 (unbiased recall on the bottom 60% least popular items on KuaiRand), and (4) R@20 (standard biased recall on OTTO). Results are summarized in Table 6.

The results in Table 6 clearly demonstrate the orthogonal contributions and synergistic effects of our two modules:

PA is Key to Debiasing and Calibration: Comparing (1) IGT-Base and (3) w/o TC, adding only the PA module brings the largest performance leap. On KuaiRand, unbiased SNIPS-R@20 increased from 10.88% to 12.65%. More importantly, PA significantly improved model calibration, with ECE dropping from 6.45% to 3.33%, and long-tail recommendation capability (Long-Tail R@20 from 4.93% to 7.81%). This confirms that propensity weighting is the core mechanism for addressing exposure bias, recovering long-tail items, and making model confidence more reliable.
TC Provides Robust Regularization: Comparing (3) w/o TC and (4) PTE-GT, adding the TC module on top of an already PA-enhanced model further improved SNIPS-R@20 from 12.65% to 13.05% and long-tail recall to 8.15%. Similarly, comparing (1) IGT-Base and (2) w/o PA, the TC module also brought moderate but consistent improvements to standard training (R@20 from 47.92% to 48.31%) and calibration, with ECE decreasing from 5.12% to 4.95%. This indicates that our temporal consistency contrastive loss acts as an effective regularizer, further enhancing generalization and robustness by encouraging the model to learn more time-invariant session representations. Figure 7 illustrates these trends visually: on KuaiRand, the largest absolute gains appear on long-tail SNIPS-R@20, whereas on OTTO the gains are smaller but consistently positive across all variants. These empirical gains are consistent with the general theory of consistency training on sequences [33,34], where enforcing agreement between multiple views of the same latent intent is expected to reduce representation variance and enhance robustness, especially under distributional shift and exposure bias.
Biased Metrics Mask the True Gap: On the OTTO dataset, the standard R@20 improvement of the full model (49.13%) over the baseline (47.92%) appears modest (+2.5%). However, in the unbiased evaluation on KuaiRand, the relative improvement in SNIPS-R@20 is a striking 19.9% (from 10.88% to 13.05%). This highlights that evaluating models on standard (biased) metrics severely underestimates the true problem caused by exposure bias and the genuine value brought by our propensity-aware approach.

7.2. Sensitivity to SNIPS Clipping Threshold c

Our propensity-aware learning relies on the SNIPS estimator, where the clipping threshold c is a key hyperparameter. c controls the bias–variance trade-off: a small c clips large propensity weights, reducing variance but reintroducing bias; a large c allows for more extreme weights, reducing bias but increasing variance and training instability.

We train our PTE-GT model on the KuaiRand dataset while varying the value of c (from 1 to 100) and plot the results in Figure 8. This combination chart clearly illustrates the bias–variance trade-off.

The bars (left Y-axis) represent the model performance (SNIPS-R@20). When c is very small (e.g.,

c = 1

or

c = 5

), performance is suppressed, as excessive clipping reintroduces bias. Performance steadily improves as c increases, peaking at

c = 20

(13.05%). Beyond this point (

c = 50, 100

), performance slightly degrades.

The line (right Y-axis, log scale) represents the SNIPS weight variance (

R_{var}

). This variance remains low and stable for

c \leq 20

but increases exponentially thereafter. This sharp rise in variance coincides with the performance degradation, as the extreme weights lead to an unstable training process. Therefore, we select

c = 20

as the optimal trade-off for all our experiments.

To further quantify this bias–variance trade-off and to assess its robustness across datasets and propensity model capacities, Figure 9 extends the analysis to both KuaiRand and OBD. Figure 9a plots SNIPS-R@20 as a function of the clipping threshold

c \in {2, 5, 10, 15, 20, 30, 50, 75, 100}

when using the LR-Full propensity model. On KuaiRand, performance increases steadily from 12.40% at

c = 2

to a peak of 13.05% at

c = 20

, then declines slightly for larger thresholds (12.90% at

c = 100

). OBD exhibits a similar pattern, rising from 11.88% at

c = 2

to 12.74% at

c = 20

before plateauing. Figure 9b reports the corresponding weight variance on a logarithmic scale: for both datasets, Var

(w)

remains close to one up to

c = 20

but grows rapidly thereafter, reaching 3.00 on KuaiRand and 2.80 on OBD at

c = 100

. Figure 9c,d compare the three propensity models from Table 2—LR-Restrict, LR-Full, and MLP-2layer—in terms of AUC versus ECE and SNIPS-R@20 versus Var

(w)

. LR-Full attains almost the same SNIPS-R@20 as the MLP (13.05% vs. 13.09% on KuaiRand; 12.74% vs. 12.81% on OBD) while keeping Var

(w)

much closer to LR-Restrict, placing it closest to the empirical Pareto frontier on both datasets.

On both KuaiRand and OBD, SNIPS-R@20 increases steadily as the clipping threshold c grows from 5 to 20 (e.g., 12.60% to 13.05% on KuaiRand; 12.10% to 12.74% on OBD) but starts to saturate and slightly degrade beyond

c = 20

, while the weight variance explodes on a logarithmic scale (e.g., from 1.00 up to 3.00 on KuaiRand and 2.80 on OBD at

c = 100

). This confirms that our default choice

c = 20

is robust across datasets with randomized and logged propensities.

7.3. Context Window Size on the Open Bandit Dataset

On OBD, we further examine how the impression-level context window size w affects performance for both PTE-GT and the SASRec-IPS baseline. Figure 10 reports SNIPS-R@20 and SNIPS-M@20 as functions of

w \in {0, 3, 5, 8, 10, 15, 20, 30}

. For PTE-GT, SNIPS-R@20 increases from 11.86% at

w = 0

(no history) to a peak of 12.74% at

w = 10

, then slightly decreases to 12.68% and 12.60% at

w = 20

and

w = 30

, respectively. SASRec-IPS exhibits a similar shape but remains consistently below PTE-GT (e.g., 10.85% to 11.48% to 11.40%). The SNIPS-M@20 curves follow the same pattern. Very short histories (

w = 0

or

w = 5

) do not fully capture short-term preferences, while overly long histories (

w \geq 20

) introduce additional noise. A moderate window of

w = 10

therefore offers the best trade-off on OBD and is used in all main experiments.

Statistical significance results are summarized in Table 7.

7.4. Qualitative Analysis of Interval-Aware Attention

To provide insight into our IGT backbone, we analyze its internal attention mechanism (Figure 11). We find clear evidence of head specialization. The model learns to distribute attention over distinct structural patterns Figure 11a, such as ‘Immediate Forward’ edges versus ‘Far-Forward Shortcut’ edges. Furthermore, this structural preference strongly aligns with temporal preferences Figure 11b. A t-distributed Stochastic Neighbor Embedding (t-SNE) projection of the interval embeddings shows that heads specializing in structural shortcuts also learn to focus on long time intervals, while sequential heads focus on short-to-medium intervals.

This spatio-temporal synergy is exemplified by Head ‘L1-H4’, which dedicates its mass to ‘Far-Forward Shortcut’ edges (59.2%) and concurrently dominates the long-interval region of the time manifold (orange cluster). In contrast, ‘L1-H1’ acts as a sequential processor (58.3% on ‘Immediate Forward’). This confirms our model learns to find unified patterns, such as “long-term, time-delayed dependencies,” rather than treating time and structure as independent features.

7.5. Comparison of Contrastive Learning Strategies

Finally, we validate the effectiveness of our proposed view-free temporal consistency regularizer TC by comparing it against a more standard graph augmentation-based contrastive learning method. We create a variant named PTE-GT (GraphAug), which replaces TC with a common graph CL strategy (i.e., creating two stochastic augmentations of the session graph, such as 20% node dropping and 20% edge perturbation, and then maximizing the consistency between the two augmented views). Unbiased evaluation results on KuaiRand show that PTE-GT (GraphAug) achieves a SNIPS-R@20 of 12.98%, which is slightly lower than our standard PTE-GT model’s 13.05%. However, in terms of computational cost, PTE-GT (GraphAug) introduces significant overhead: due to the need to construct and encode two separate, augmented graphs for each session, the training time per epoch increases by approximately 34%. In contrast, our proposed TC module, which operates only on the encoded representations, adds only about 8% training overhead. This comparison confirms that our designed view-free temporal consistency strategy is not only more computationally efficient but also more effective (at least compared to standard graph augmentation strategies), providing an effective and low-cost regularization for the model.

8. Case Study: Structure-Aware Evidence Flow to the Hit Item

To make the graph-specific reasoning of our model tangible, we visualize a single OTTO-style session using a Sankey diagram that aggregates evidence flows from upstream nodes to the top-1 hit item L. Edge width encodes a composite evidence weight

W_{i j} = (\sum_{h = 1}^{H} u_{h} α_{i j}^{(h)}) \cdot A_{i j} \cdot exp (- η Δ t_{i j}),

(14)

combining head-weighted graph attention

α_{i j}^{(h)}

, edge attribution

A_{i j}

(Integrated Gradients), and a temporal decay governed by

η

. For each candidate path

P

to L, we summarize evidence by the geometric mean

PathW (P)

of

{W_{e}}_{e \in P}

and normalize across the top evidence paths to obtain flow proportions.

The session is

A (0 s) \to B (18) \to C (41) \to D (65) \to E (71) \to F (96) \to G (131) \to H (145) \to L (158),

with two non-sequential (shortcut) edges:

C \to F

(

Δ t = 55

s) and

B \to E

(

Δ t = 53

s). The four highest-evidence paths to L are as follows:

P4: $B \to C \to F \to H \to L$ (30.2%)
P1: $A \to B \to C \to F \to G \to H \to L$ (28.5%)
P2: $A \to B \to E \to F \to G \to H \to L$ (26.6%)
P3: $A \to B \to C \to D \to E \to F \to G \to H \to L$ (14.7%)

These flows sum to

100 %

by construction. Notably,

C \to F

and

B \to E

carry the majority of total evidence via P4/P1/P2 (85.3%), while the purely sequential route P3 contributes only 14.7%.

Figure 12 visualizes this session.

The concentration of evidence on shortcut edges, despite longer temporal gaps, indicates that the model leverages graph connectivity beyond strict sequence order. This structure-aware behavior is precisely where the graph transformer outperforms sequential baselines.

9. Discussion and Conclusions

9.1. Discussion

Our findings suggest that explicitly addressing exposure bias is important in the training and evaluation of session-based recommender systems. Our unbiased evaluation on the KuaiRand and OBD datasets (Table 4) reveals a stark reality: models that perform well on standard (biased) metrics (e.g., on OTTO) may largely be fitting the biases introduced by the data collection policy, rather than the users’ true underlying preferences. The disparity in performance between standard evaluation (Table 3) and unbiased evaluation (Table 4) indicates that relying solely on observational logs for model assessment can be misleading, especially when long-tail behavior is of interest.

Our framework, PTE-GT, addresses this challenge through two orthogonal and complementary modules. First, PA is core to achieving debiasing. By adopting the SNIPS estimator, the model is forced to re-weight its loss function, increasing its focus on items that were clicked despite having low exposure rates (i.e., low propensity scores). As shown in the ablation study (Table 6), PA is the primary driver for improving long-tail recall (e.g., +59% in Long-Tail R@20) and calibration (e.g., >48% reduction in ECE). This confirms our hypothesis that directly optimizing an unbiased risk estimate is the fundamental way to correct the model’s systematic bias towards popular items.

Second, the view-free TC regularizer provides a lightweight yet efficient solution to enhance representation robustness. Unlike relying on expensive graph augmentation strategies (which, as our analysis showed, add >35% training overhead), our prefix–suffix alignment strategy is computationally inexpensive (adding only 8% overhead) while still delivering consistent performance gains (Table 6). It improves generalization by encouraging the model to learn representations that are insensitive to minor temporal perturbations within a session, serving as a valuable addition on top of the primary debiasing already performed by the PA module.

However, our method also has limitations, which point to directions for future work.The accuracy of the propensity model in this study is critical. Although our default deployment uses the calibrated LR-Full model and we empirically compared it against both a more restricted logistic regression and a two-layer MLP in Table 2, more complex non-linear models can provide slightly higher AUC at the cost of substantially larger weight variance as visualized in Figure 9. Exploring this bias–variance trade-off, as well as investigating more advanced variance reduction techniques (beyond SNIPS clipping), would be a valuable research direction.

In addition to classical IPS-based baselines, we also compared PTE-GT against a CausalRec-style front-door debiasing model and a doubly robust variant (DR-JL) built on top of SASRec. These causal/DR baselines consistently outperform plain IPS, but PTE-GT still achieves the best SNIPS-weighted accuracy on both KuaiRand and OBD, reinforcing the benefit of combining an interval-aware graph transformer with a carefully calibrated SNIPS objective and temporal consistency regularization.

9.2. Conclusions

In this paper, we proposed PTE-GT, a debiased graph transformer framework for session-based recommendation. Our model integrates two key training-time modules onto an advanced interval-aware graph transformer backbone: a propensity-aware objective function based on SNIPS to correct for exposure bias; and a lightweight, view-free temporal consistency contrastive regularizer to enhance the robustness of session representations.

Through a comprehensive evaluation on three public session-based benchmarks and one public logged bandit dataset (OBD), we demonstrated that our approach not only outperforms baselines on standard recommendation metrics but, more importantly, achieves significant performance leaps under an unbiased evaluation protocol enabled by randomized exposure or logged propensities. Our model exhibits superior performance in terms of recommendation accuracy, calibration, and long-tail item discovery. This work provides a practical and effective pathway toward building more reliable and fair session-based recommender systems in bias-laden real-world environments.

Author Contributions

Conceptualization, Y.W., X.Q. and K.Z.; Methodology, X.Q. and K.Z.; Software, Y.W. and J.S.; Validation, Y.W. and J.S.; Formal Analysis, X.Q. and K.Z.; Investigation, Y.W.; Resources, X.Q. and K.Z.; Data Curation, Y.W. and J.S.; Writing—Original Draft Preparation, Y.W.; Writing—Review and Editing, Y.W., X.Q., J.S. and K.Z.; Visualization, Y.W. and J.S.; Supervision, X.Q. and K.Z.; Project Administration, Y.W.; Funding Acquisition, X.Q. and K.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

All public datasets used in this study are openly available: KuaiRand (available online: https://kuairand.com/ (accessed on 13 November 2025)); OTTO (available online: https://doi.org/10.34740/KAGGLE/DSV/4991874 (accessed on 13 November 2025)); YOOCHOOSE (available online: https://recsys.acm.org/recsys15/challenge/ (accessed on 13 November 2025)); and the Open Bandit Dataset (available online: https://research.zozo.com/data.html (accessed on 13 November 2025)).

Conflicts of Interest

The authors declare no conflicts of interest.

Notation

Notation	Description
$S, G$	A user session and its corresponding graph representation
$x_{i}, t_{i}$	An item interaction and its timestamp
T	The length (number of items) of a session
$h_{i}, s$	Node (item) representation and final session representation
d	Dimension of embedding vectors
$L, H$	Number of layers and attention heads in the transformer
$Δ t$	Time interval feature between two events
$ℓ_{θ}$	The base cross-entropy loss function
${\hat{p}}_{n}$	The estimated propensity score (exposure probability)
$w_{n}, c$	SNIPS weight and its clipping threshold
$J_{\sup}$	The propensity-aware SNIPS supervised objective
$L_{con}$	The temporal consistency contrastive loss
$τ$	Temperature parameter for contrastive loss

References

Wang, S.; Cao, L.; Wang, Y.; Sheng, Q.Z.; Orgun, M.A.; Lian, D. A Survey on Session-Based Recommender Systems. ACM Comput. Surv. (CSUR) 2021, 54, 1–38. [Google Scholar] [CrossRef]
Salampasis, M.; Katsalis, A.; Siomos, T.; Delianidi, M.; Tektonidis, D.; Christantonis, K.; Kaplanoglou, P.; Karaveli, I.; Bourlis, C.; Diamantaras, K. A Flexible Session-Based Recommender System for e-Commerce. Appl. Sci. 2023, 13, 3347. [Google Scholar] [CrossRef]
Wu, Y.; Yusof, Y. Emerging Trends in Real-Time Recommendation Systems: A Deep Dive into Multi-Behavior Streaming Processing and Recommendation for e-Commerce Platforms. J. Internet Serv. Inf. Secur. 2024, 14, 45–66. [Google Scholar] [CrossRef]
Casino, F.; Patsakis, C. An Efficient Blockchain-Based Privacy-Preserving Collaborative Filtering Architecture. IEEE Trans. Eng. Manag. 2019, 67, 1501–1513. [Google Scholar] [CrossRef]
Kang, W.C.; McAuley, J. Self-Attentive Sequential Recommendation. In Proceedings of the 2018 IEEE International Conference on Data Mining (ICDM), Singapore, 17–20 November 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 197–206. [Google Scholar] [CrossRef]
Qiu, R.; Li, J.; Huang, Z.; Yin, H. Rethinking the Item Order in Session-Based Recommendation with Graph Neural Networks. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management, Beijing, China, 3–7 November 2019; pp. 579–588. [Google Scholar] [CrossRef]
Li, A.; Zhu, J.; Li, Z.; Cheng, H. Transition Information Enhanced Disentangled Graph Neural Networks for Session-Based Recommendation. Expert Syst. Appl. 2022, 210, 118336. [Google Scholar] [CrossRef]
Pang, Y.; Wu, L.; Shen, Q.; Zhang, Y.; Wei, Z.; Xu, F.; Chang, E.; Long, B.; Pei, J. Heterogeneous Global Graph Neural Networks for Personalized Session-Based Recommendation. In Proceedings of the Fifteenth ACM International Conference on Web Search and Data Mining, Tempe, AZ, USA, 21–25 February 2022; pp. 775–783. [Google Scholar] [CrossRef]
Min, E.; Chen, R.; Bian, Y.; Xu, T.; Zhao, K.; Huang, W.; Zhao, P.; Huang, J.; Ananiadou, S.; Rong, Y. Transformer for Graphs: An Overview from Architecture Perspective. arXiv 2022, arXiv:2202.08455. [Google Scholar] [CrossRef]
Yu, F.; Zhu, Y.; Liu, Q.; Wu, S.; Wang, L.; Tan, T. TAGNN: Target Attentive Graph Neural Networks for Session-Based Recommendation. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, Virtual, 25–30 July 2020; pp. 1921–1924. [Google Scholar] [CrossRef]
Xia, X.; Yin, H.; Yu, J.; Wang, Q.; Cui, L.; Zhang, X. Self-Supervised Hypergraph Convolutional Networks for Session-Based Recommendation. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtual, 2–9 February 2021; Volume 35, pp. 4503–4511. [Google Scholar]
Chen, T.; Wong, R.C.W. Handling Information Loss of Graph Neural Networks for Session-Based Recommendation. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, Virtual, 23–27 August 2020; pp. 1172–1180. [Google Scholar] [CrossRef]
Greenland, S. Multiple-Bias Modelling for Analysis of Observational Data. J. R. Stat. Soc. Ser. A Stat. Soc. 2005, 168, 267–306. [Google Scholar] [CrossRef]
Bero, L.; Chartres, N.; Diong, J.; Fabbri, A.; Ghersi, D.; Lam, J.; Lau, A.; McDonald, S.; Mintzes, B.; Sutton, P.; et al. The Risk of Bias in Observational Studies of Exposures (ROBINS-E) Tool: Concerns Arising from Application to Observational Studies of Exposures. Syst. Rev. 2018, 7, 242. [Google Scholar] [CrossRef] [PubMed]
Pozzi, A.; Incremona, A.; Tessera, D.; Toti, D. Mitigating Exposure Bias in Large Language Model Distillation: An Imitation Learning Approach. Neural Comput. Appl. 2025, 37, 12013–12029. [Google Scholar] [CrossRef]
Wang, X.; Bendersky, M.; Metzler, D.; Najork, M. Learning to Rank with Selection Bias in Personal Search. In Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval, Pisa, Italy, 17–21 July 2016; pp. 115–124. [Google Scholar] [CrossRef]
Carnovalini, F.; Rodà, A.; Wiggins, G.A. Popularity Bias in Recommender Systems: The Search for Fairness in the Long Tail. Information 2025, 16, 151. [Google Scholar] [CrossRef]
Oosterhuis, H.; de de Rijke, M. Robust Generalization and Safe Query-Specializationin Counterfactual Learning to Rank. In Proceedings of the Web Conference 2021, Ljubljana, Slovenia, 19–23 April 2021; pp. 158–170. [Google Scholar] [CrossRef]
Allan, V.; Ramagopalan, S.V.; Mardekian, J.; Jenkins, A.; Li, X.; Pan, X.; Luo, X. Propensity Score Matching and Inverse Probability of Treatment Weighting to Address Confounding by Indication in Comparative Effectiveness Research of Oral Anticoagulants. J. Comp. Eff. Res. 2020, 9, 603–614. [Google Scholar] [CrossRef] [PubMed]
Gao, C.; Li, S.; Zhang, Y.; Chen, J.; Li, B.; Lei, W.; Jiang, P.; He, X. Kuairand: An Unbiased Sequential Recommendation Dataset with Randomly Exposed Videos. In Proceedings of the 31st ACM International Conference on Information & Knowledge Management, Atlanta, GA, USA, 17–21 October 2022; pp. 3953–3957. [Google Scholar] [CrossRef]
Normann, P.; Baumeister, S.; Wilm, T. OTTO Recommender Systems Dataset. Kaggle, 2023. Available online: https://www.kaggle.com/datasets/otto/recsys-dataset (accessed on 21 December 2025).
Hidasi, B.; Karatzoglou, A.; Baltrunas, L.; Tikk, D. Session-Based Recommendations with Recurrent Neural Networks. arXiv 2015, arXiv:1511.06939. [Google Scholar]
Wu, S.; Tang, Y.; Zhu, Y.; Wang, L.; Xie, X.; Tan, T. Session-Based Recommendation with Graph Neural Networks. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019; Volume 33, pp. 346–353. [Google Scholar]
Wilm, T.; Normann, P.; Baumeister, S.; Kobow, P.V. Scaling Session-Based Transformer Recommendations Using Optimized Negative Sampling and Loss Functions. In Proceedings of the 17th ACM Conference on Recommender Systems, Singapore, 18–22 September 2023; pp. 1023–1026. [Google Scholar] [CrossRef]
Wang, H.; Zeng, Y.; Chen, J.; Han, N.; Chen, H. Interval-Enhanced Graph Transformer Solution for Session-Based Recommendation. Expert Syst. Appl. 2023, 213, 118970. [Google Scholar] [CrossRef]
Swaminathan, A.; Joachims, T. Counterfactual Risk Minimization: Learning from Logged Bandit Feedback. In Proceedings of the International Conference on Machine Learning, Lille, France, 6–11 July 2015; PMLR: New York, NY, USA, 2015; pp. 814–823. [Google Scholar]
Swaminathan, A.; Joachims, T. The Self-Normalized Estimator for Counterfactual Learning. In Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada, 7–12 December 2015; Volume 28. [Google Scholar]
Joachims, T.; Swaminathan, A.; Schnabel, T. Unbiased Learning-to-Rank with Biased Feedback. In Proceedings of the Tenth ACM International Conference on Web Search and Data Mining, Cambridge, UK, 6–10 February 2017; pp. 781–789. [Google Scholar] [CrossRef]
Saito, Y.; Aihara, S.; Matsutani, M.; Narita, Y. Open Bandit Dataset and Pipeline: Towards Realistic and Reproducible Off-Policy Evaluation. arXiv 2020, arXiv:2008.07146. [Google Scholar]
OBP Documentation: Open Bandit Pipeline. Available online: https://zr-obp.readthedocs.io/en/latest/ (accessed on 10 November 2025).
Wu, J.; Wang, X.; Feng, F.; He, X.; Chen, L.; Lian, J.; Xie, X. Self-Supervised Graph Learning for Recommendation. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, Virtual, 11–15 July 2021; pp. 726–735. [Google Scholar] [CrossRef]
Yu, J.; Yin, H.; Xia, X.; Chen, T.; Cui, L.; Nguyen, Q.V.H. Are Graph Augmentations Necessary? Simple Graph Contrastive Learning for Recommendation. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, Madrid, Spain, 11–15 July 2022; pp. 1294–1303. [Google Scholar] [CrossRef]
Chong, L.; Liu, X.; Zheng, R.; Zhang, L.; Liang, X.; Li, J.; Wu, L.; Zhang, M.; Lin, L. Ct4rec: Simple yet effective consistency training for sequential recommendation. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Long Beach, CA, USA, 6–10 August 2023; pp. 3901–3913. [Google Scholar]
Babaev, D.; Ovsov, N.; Kireev, I.; Ivanova, M.; Gusev, G.; Nazarov, I.; Tuzhilin, A. CoLES: Contrastive learning for event sequences with self-supervision. In Proceedings of the 2022 International Conference on Management of Data, Philadelphia, PA, USA, 12–17 June 2022; pp. 1190–1199. [Google Scholar]
Zhu, Z.; Chen, J.; Yang, C.; Chen, J.; Wu, T.; Wang, Z.; Li, J.; Ofek, E.; He, X.; Bian, J. CausalRec: Causal Inference for Visual Debiasing in Visually-Aware Recommendation. In Proceedings of the 29th ACM International Conference on Multimedia, Virtual, 20–24 October 2021; pp. 2356–2364. [Google Scholar]
Jiang, N.; Li, L. Doubly Robust Off-Policy Value Evaluation for Reinforcement Learning. In Proceedings of the 33rd International Conference on Machine Learning, New York, NY, USA, 19–24 June 2016; pp. 652–661. [Google Scholar]
Li, H.; Liu, P.; Zhao, Y.; Wang, X.; Wang, J.; Su, H.; Zhu, J. Multiple Robust Learning for Recommendation. In Proceedings of the AAAI Conference on Artificial Intelligence, Washington, DC, USA, 7–14 February 2023; pp. 4300–4308. [Google Scholar]
Luo, H.; Zhuang, F.; Xie, R.; Zhu, H.; Wang, D.; An, Z.; Xu, Y. A Survey on Causal Inference for Recommendation. Innovation 2024, 5, 100590. [Google Scholar] [CrossRef] [PubMed]
Gao, C.; Zheng, Y.; Wang, W.; Feng, F.; He, X.; Li, Y. Causal Inference in Recommender Systems: A Survey and Future Directions. ACM Trans. Inf. Syst. 2024, 42, Article 88, 88:1–88:32. [Google Scholar] [CrossRef]
ZOZO Research: Open Bandit Dataset. Available online: https://research.zozo.com/data.html (accessed on 10 November 2025).
Tan, Y.K.; Xu, X.; Liu, Y. Improved Recurrent Neural Networks for Session-Based Recommendations. In Proceedings of the 1st Workshop on Deep Learning for Recommender Systems, Boston, MA, USA, 15 September 2016; pp. 17–22. [Google Scholar] [CrossRef]

Figure 1. The end-to-end framework illustrating session graph construction, the interval-aware graph transformer, propensity-aware objective, and temporal consistency module.

Figure 2. The propensity-aware learning flow, showing propensity estimation from randomized logs and the application of SNIPS weighting.

Figure 3. The view-free temporal consistency mechanism, showing the prefix–suffix split, temporal decay weighting, and contrastive alignment.

Figure 4. Comprehensive calibration analysis on KuaiRand and OTTO. (a,c) Reliability diagrams comparing IGT-Base and PTE-GT; the diagonal (dashed line) represents perfect calibration. (b,d) Histograms of predicted confidence for ground-truth items, illustrating that PTE-GT avoids the overconfidence of the baseline.

Figure 5. Comprehensive analysis of long-tail performance across item popularity deciles (1 = most popular, 10 = long-tail). (Top row) shows Recall@20. (Bottom row) shows MRR@20.

Figure 6. SNIPS-weighted long-tail performance by item popularity decile on KuaiRand (KuaiRand) and OBD (OBD). The figure reports SNIPS-R@20 and SNIPS-M@20 across popularity deciles.

Figure 7. Component-wise ablation on KuaiRand and OTTO. Panel (a) shows mean SNIPS-R@20 on KuaiRand, panel (b) shows SNIPS-R@20 on the bottom 60% least popular items on KuaiRand (long-tail), and panel (c) shows standard R@20 on OTTO. Error bars indicate the standard deviation over five runs.

Figure 8. Sensitivity to the SNIPS clipping threshold c on KuaiRand-27K. The bars (left axis) show SNIPS-R@20, peaking at

c = 20

. The line (right axis, log scale) shows the weight variance, which increases exponentially after the optimal point, illustrating the bias–variance trade-off.

Figure 8. Sensitivity to the SNIPS clipping threshold c on KuaiRand-27K. The bars (left axis) show SNIPS-R@20, peaking at

c = 20

. The line (right axis, log scale) shows the weight variance, which increases exponentially after the optimal point, illustrating the bias–variance trade-off.

Figure 9. Bias–variance and propensity model analysis on KuaiRand and OBD. Panel (a) shows SNIPS-R@20 as a function of the clipping threshold c for KuaiRand and OBD using the LR-Full propensity model. Panel (b) plots the corresponding variance of the SNIPS weights on a logarithmic scale. Panel (c) compares ROC-AUC versus ECE for three propensity models (LR-Restrict, LR-Full, MLP-2layer) on both datasets. Panel (d) shows SNIPS-R@20 versus Var

(w)

.

Figure 9. Bias–variance and propensity model analysis on KuaiRand and OBD. Panel (a) shows SNIPS-R@20 as a function of the clipping threshold c for KuaiRand and OBD using the LR-Full propensity model. Panel (b) plots the corresponding variance of the SNIPS weights on a logarithmic scale. Panel (c) compares ROC-AUC versus ECE for three propensity models (LR-Restrict, LR-Full, MLP-2layer) on both datasets. Panel (d) shows SNIPS-R@20 versus Var

(w)

.

Figure 10. Effect of the context window size w on OBD. Panels (a,b) show SNIPS-R@20 and SNIPS-M@20, respectively, for PTE-GT and SASRec-IPS as w varies from 0 to 30 impressions. Points denote means over three random seeds and error bars denote one standard deviation.

Figure 11. Qualitative analysis of the interval-aware attention mechanism. (a) Attention mass distribution over four structural edge types, revealing head specialization. (b) t-SNE projection of Fourier interval embeddings (

Δ t

). Colors show the dominant attention head, demonstrating a clear alignment between structural preference (a) and temporal preference (b).

Figure 11. Qualitative analysis of the interval-aware attention mechanism. (a) Attention mass distribution over four structural edge types, revealing head specialization. (b) t-SNE projection of Fourier interval embeddings (

Δ t

). Colors show the dominant attention head, demonstrating a clear alignment between structural preference (a) and temporal preference (b).

Figure 12. Case: Session flow Sankey with evidence weights. Evidence concentrates on time-salient shortcut edges (

C \to F

,

B \to E

), forming dominant paths to the hit item L (P4: 30.2%, P1: 28.5%, P2: 26.6%), while the purely sequential path (P3: 14.7%) is comparatively weaker. This highlights non-sequential, structure-aware reasoning afforded by the graph transformer.

Figure 12. Case: Session flow Sankey with evidence weights. Evidence concentrates on time-salient shortcut edges (

C \to F

,

B \to E

), forming dominant paths to the hit item L (P4: 30.2%, P1: 28.5%, P2: 26.6%), while the purely sequential path (P3: 14.7%) is comparatively weaker. This highlights non-sequential, structure-aware reasoning afforded by the graph transformer.

Table 1. Dataset descriptive statistics. Note: OBD consists of impression-level bandit logs with logged propensities; session-length metrics are not directly applicable.

Dataset (Split)	Sequence Unit	Sequences	Items	Interactions	Avg. Length	Randomized Propensities
KuaiRand-27K	user history	27,285	32,038,725	322,278,385	—	Yes
Open Bandit Dataset	impression log	26 M	80	—	—	Yes (logged bandit)
OTTO (train)	session	12,899,779	1,855,603	216,716,096	16.80	No
YOOCHOOSE-1/64	session	369,859	16,766	557,248	6.16	No

Table 2. Comparison of propensity models on KuaiRand and OBD. All SNIPS-R@20 and Var

(w)

values use clipping threshold

c = 20

.

Table 2. Comparison of propensity models on KuaiRand and OBD. All SNIPS-R@20 and Var

(w)

values use clipping threshold

c = 20

.

Dataset	Propensity Model	AUC	ECE (%)	Brier (%)	SNIPS-R@20 (%)	Var $(w)$
KuaiRand	LR-Restrict	0.74	2.80	7.50	12.47	0.83
KuaiRand	LR-Full (ours)	0.78	1.90	7.12	13.05	1.00
KuaiRand	MLP-2layer	0.80	1.60	7.05	13.09	1.90
OBD	LR-Restrict	0.76	2.50	6.50	12.00	0.78
OBD	LR-Full (ours)	0.80	1.70	6.20	12.74	1.00
OBD	MLP-2layer	0.82	1.50	6.15	12.81	1.85

Table 3. Comprehensive performance and complexity comparison on OTTO and YOOCHOOSE-1/64 datasets (%). All results are averaged over five random seeds.

Model	Params (M)	OTTO (Clicks)						YOOCHOOSE-1/64
Model	Params (M)	R@10	N@10	M@10	R@20	N@20	M@20	R@10	N@10	M@10	R@20	N@20	M@20
Baselines (Reproduced/Literature)
GRU4Rec+	5.2	27.12	14.03	13.51	44.31	27.84	20.53	38.17	25.02	15.11	60.63	34.16	22.89
SASRec	6.8	19.04	11.52	11.03	30.71	20.13	18.02	45.91	32.14	19.06	69.17	40.82	29.88
SR-GNN	7.1	28.03	14.81	14.12	45.83	28.91	21.14	48.02	34.89	20.18	70.56	42.11	30.94
TRON	7.0	29.01	15.53	14.92	47.24	30.03	21.91	48.53	35.11	20.52	70.83	42.51	31.02
Ours
IGT-Base	7.5	29.54	15.92	15.21	47.91	30.52	22.34	48.71	35.26	20.61	70.97	42.63	31.07
PTE-GT (Full)	7.7	30.53	16.64	15.82	49.15	31.54	23.01	49.84	36.23	21.42	71.86	43.82	32.05
Improv. (vs. TRON)	-	+5.24%	+7.15%	+6.03%	+4.04%	+5.03%	+5.02%	+2.70%	+3.19%	+4.39%	+1.45%	+3.08%	+3.32%

Table 4. Unbiased evaluation on KuaiRand and OBD (SNIPS-weighted, %). All metrics are averaged over five random seeds.

Model	KuaiRand-27K		OBD (All)
	SNIPS-R@20 (%)	SNIPS-M@20 (%)	SNIPS-R@20 (%)	SNIPS-M@20 (%)
SASRec	10.15	4.32	9.92	4.18
IGT-Base	10.88	4.65	10.63	4.44
SASRec-IPS	11.75	5.02	11.48	4.87
CausalRec-adapt	12.03	5.21	11.96	5.04
DR-JL	12.54	5.32	12.31	5.17
PTE-GT (Full Model)	13.05	5.58	12.74	5.36

Table 5. Calibration performance on OTTO and KuaiRand (ECE (%), Brier (%)).

Model	OTTO (Standard)		KuaiRand (Unbiased)
	ECE (%)	Brier (%)	ECE (%)	Brier (%)
IGT-Base	5.12	7.82	6.45	9.18
PTE-GT	2.85	5.11	3.11	6.24

Table 6. Ablation study of PTE-GT components. IGT-Base is our backbone, PA applies the SNIPS objective, and TC adds temporal consistency regularization.

Model Variant	KuaiRand-27K (Unbiased)			OTTO (Standard)
	SNIPS-R@20 (%)	ECE (%)	Long-Tail R@20 (%)	R@20 (%)	ECE (%)
(1) `IGT-Base`	10.88	6.45	4.93	47.92	5.12
(2) `w/o PA` (`IGT-Base` + `TC`)	11.05	6.23	5.11	48.31	4.95
(3) `w/o TC` (`IGT-Base` + `PA`)	12.65	3.33	7.81	48.82	3.12
(4) `PTE-GT` (Full Model)	13.05	3.11	8.15	49.13	2.85

Table 7. Statistical significance of PTE-GT compared to strong baselines (R@20 or SNIPS-R@20, five random seeds). Mean differences and confidence intervals are reported in percentage points (pp).

Dataset	Metric	Comparison	Mean Diff (pp)	95% CI (pp)	p-Value
OTTO	R@20	PTE-GT vs. TRON	+1.91	[+1.70, +2.10]	< $10^{- 4}$
OTTO	R@20	PTE-GT vs. IGT-Base	+1.24	[+1.05, +1.43]	0.0003
YOOCHOOSE	R@20	PTE-GT vs. SR-GNN	+1.30	[+0.65, +1.95]	0.002
YOOCHOOSE	R@20	PTE-GT vs. IGT-Base	+0.89	[+0.05, +1.73]	0.041
KuaiRand	SNIPS-R@20	PTE-GT vs. IGT-Base	+2.17	[+1.96, +2.38]	< $10^{- 4}$
OBD	SNIPS-R@20	PTE-GT vs. SASRec-IPS	+1.26	[+0.72, +1.80]	0.001
OBD	SNIPS-R@20	PTE-GT vs. DR-JL	+0.64	[+0.15, +1.13]	0.033

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Wang, Y.; Si, J.; Qiu, X.; Zhu, K. Debiasing Session-Based Recommendation for the Digital Economy: Propensity-Aware Training and Temporal Contrast on Graph Transformers. Electronics 2026, 15, 84. https://doi.org/10.3390/electronics15010084

AMA Style

Wang Y, Si J, Qiu X, Zhu K. Debiasing Session-Based Recommendation for the Digital Economy: Propensity-Aware Training and Temporal Contrast on Graph Transformers. Electronics. 2026; 15(1):84. https://doi.org/10.3390/electronics15010084

Chicago/Turabian Style

Wang, Yongjian, Junru Si, Xuhua Qiu, and Kunjie Zhu. 2026. "Debiasing Session-Based Recommendation for the Digital Economy: Propensity-Aware Training and Temporal Contrast on Graph Transformers" Electronics 15, no. 1: 84. https://doi.org/10.3390/electronics15010084

APA Style

Wang, Y., Si, J., Qiu, X., & Zhu, K. (2026). Debiasing Session-Based Recommendation for the Digital Economy: Propensity-Aware Training and Temporal Contrast on Graph Transformers. Electronics, 15(1), 84. https://doi.org/10.3390/electronics15010084

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Debiasing Session-Based Recommendation for the Digital Economy: Propensity-Aware Training and Temporal Contrast on Graph Transformers

Abstract

1. Introduction

2. Related Work

3. Datasets

4. Methodology

4.1. Session Graph Construction and Interval-Aware Graph Transformer

4.2. Propensity-Aware Learning Objective

4.3. Temporal Consistency Contrast Without Heavy Augmentations

4.4. Training Algorithm and Complexity

5. Experimental Setup

5.1. Dataset Preprocessing and Evaluation Metrics

5.2. Baselines and Implementation Details

Software and Computing Environment

6. Results

6.1. Main Performance on Public Benchmarks

6.2. Unbiased Evaluation on Randomized/Logged-Propensity Datasets

6.3. Analysis of Recommendation Calibration

6.4. Performance on Long-Tail Items

7. Ablation and Analysis

7.1. Impact of Core Components

7.2. Sensitivity to SNIPS Clipping Threshold c

7.3. Context Window Size on the Open Bandit Dataset

7.4. Qualitative Analysis of Interval-Aware Attention

7.5. Comparison of Contrastive Learning Strategies

8. Case Study: Structure-Aware Evidence Flow to the Hit Item

9. Discussion and Conclusions

9.1. Discussion

9.2. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Notation

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI