You are currently viewing a new version of our website. To view the old version click .
Electronics
  • Article
  • Open Access

9 December 2025

SAGERec: Semantic-Aware Global Graph-Enhanced Representation Learning for Sequential Recommendation

and
Department of Engineering, King’s College London, Strand, London WC2R 2LS, UK
*
Author to whom correspondence should be addressed.

Abstract

Sequential recommendation aims to model evolving user preferences based on historical interactions. Transformer-based architectures have achieved strong performance by focusing on user-level sequential patterns, yet global item–item relationships are often underrepresented, limiting the ability to capture broader contextual signals. In many real-world scenarios, items contain rich textual attributes such as descriptions and categories, but these semantic features are seldom exploited in existing sequential models. To address this gap, a Semantic-Aware Global Graph-Enhanced Sequential Recommendation framework (SAGERec) is developed, in which globally derived semantic structures are incorporated to enrich item representations before sequence modeling. Large language models (LLMs) are used to generate semantically grounded item embeddings, from which a global item–item graph is constructed to capture content-level relations that extend beyond behavioral co-occurrence. These semantic relations are further refined through an adaptive edge-weight learning mechanism, enabling the graph structure to align with evolving item representations during training. The adaptively enhanced item embeddings are subsequently integrated into a lightweight Transformer-based sequential encoder for next-item prediction. Extensive experiments on three benchmark datasets demonstrate that the proposed framework consistently outperforms competitive baselines, indicating that integrating LLM-derived semantics with adaptive graph refinement leads to more expressive sequential representations.

1. Introduction

Traditional recommender systems, such as collaborative filtering (CF) and matrix factorization (MF) models, have achieved remarkable success in personalized content delivery by learning static user–item interaction patterns from historical data [1]. These approaches typically assume that user preferences remain consistent over time and rely on aggregated interactions to estimate user–item affinity. While effective in modeling long-term user interests, such static representations inherently overlook the temporal dynamics of user behavior—how preferences evolve as users are exposed to new items, contexts, and trends. As a result, conventional recommenders often fail to adapt to rapidly changing environments, especially in domains where user intent is highly time-sensitive. Sequential recommendation has emerged as a powerful paradigm to address this limitation by explicitly modeling the temporal order of user interactions [2,3,4]. Instead of treating past interactions as unordered sets, sequential recommenders consider user behavior as an ordered sequence, capturing the transitions and dependencies between consecutive actions. This paradigm enables the system to infer short-term interests and contextual patterns, which are particularly crucial in dynamic application scenarios such as e-commerce, video streaming, and news platforms [5]. Nevertheless, sequential recommendation remains challenging because the model needs to capture users’ dynamic and evolving interests while maintaining robustness under data sparsity and behavioral diversity.
In the early development of sequential recommendation, deep learning models based on recurrent and convolutional architectures were widely explored to capture temporal dependencies in user behaviors. Recurrent neural networks (RNNs), exemplified by GRU4Rec [5], model user interactions through hidden state transitions, effectively capturing short-term sequential patterns. Similarly, convolutional neural networks (CNNs), such as Caser [6], encode sequential information by applying convolutional filters over item embeddings, extracting local transition features within fixed receptive fields. While these architectures successfully introduced temporal awareness into recommender systems, they are inherently limited in modeling long-range dependencies due to their sequential processing nature and constrained contextual scope. To overcome these limitations, transformer-based models employing self-attention mechanisms have emerged as a new paradigm for sequential recommendation. By enabling each item in a sequence to directly attend to all previous interactions, self-attention networks can capture both local and global dependencies without the need for explicit recurrence or convolution. In recent years, models such as SASRec [2] and BERT4Rec [7] have demonstrated remarkable capabilities in modeling complex user behavior dynamics and contextual relationships between items. Compared with earlier RNN- or CNN-based approaches, transformer architectures offer superior flexibility, scalability, and parallelization during training and inference, leading to significant performance gains across a variety of recommendation benchmarks.
Although transformer-based models have significantly advanced the field of sequential recommendation, they still suffer from several inherent limitations that restrict their modeling capacity [8]. These models capture item dependencies solely based on the order of interactions within individual sequences, without leveraging global item–item relationships that exist across the system [9]. This limitation restricts the richness of item representations and weakens the model’s ability to capture structured knowledge about how items relate to each other, which in turn degrades recommendation quality. The impact becomes particularly pronounced when user behavior sequences are short or contain limited contextual diversity [10]. In such cases, the model is forced to rely on sparse or noisy patterns, making it difficult to identify meaningful user preferences from the observed data alone. Therefore, it is necessary to move beyond sequential co-occurrence and incorporate global structural knowledge into the modeling process. One natural way to encode such global relationships is through a graph structure, where each edge between two items can represent a specific relationship. This enables the model to explicitly capture how items are organized and connected across the entire catalog, providing complementary information to the sequential signal. But the question is, where do such global item–item relationships come from? Fortunately, in many real-world systems, items are accompanied by rich textual metadata, such as titles, descriptions, and category labels [11]. These semantic signals often reveal functional or conceptual similarities between items that are not apparent from user interaction data alone. However, existing models often treat such information as auxiliary, rather than integrating it structurally.
In summary, current sequential recommendation models suffer from three key limitations:
  • They rely solely on local item transitions, ignoring global item–item relationships.
  • They depend entirely on user interaction sequences, which are often sparse or noisy.
  • They underutilize rich item-side textual metadata.
To address these issues, this paper aims to enhance sequential recommendation by explicitly modeling global item–item relationships derived from semantic information and integrating them into the sequence learning process. We propose Semantic-Aware Global Graph-Enhanced Sequential Recommendation (SAGERec), a novel framework that combines semantic-aware graph modeling with streamlined self-attentive sequence modeling. Specifically, we first leverage a large language model (LLM) to encode item-level metadata into high-quality semantic embeddings. Based on these embeddings, we construct a global item graph that captures semantic similarities between items. To ensure adaptability, we introduce an adaptive graph learning module that learns to refine the graph structure dynamically during training. The enhanced item representations are then integrated into a transformer-based sequential model. Extensive experiments on benchmark datasets demonstrate that our model outperforms strong baselines across multiple metrics. The main contributions of this work can be summarized as follows:
  • We propose a novel framework for sequential recommendation named SAGERec, which unifies a semantic-aware global graph and transformer-based sequence modeling to jointly capture global item relationships and dynamic user preferences.
  • We introduce a learnable edge-weight module that dynamically adjusts semantic connections between items, enhancing the flexibility and generalization of the graph structure.
  • Extensive experiments on multiple real-world datasets verify the superiority of SAGERec over state-of-the-art baselines and provide insights into the benefits of incorporating semantic-aware graph enhancement.

2. Literature Review

Sequential recommendation has emerged as a critical task in personalized recommendation systems, aiming to predict a user’s next interaction based on their historical behavior sequence [2,7,12,13]. Unlike traditional collaborative filtering methods that treat user–item interactions as unordered pairs, sequential recommendation explicitly models the temporal order, thereby capturing short-term preferences and behavioral dynamics more effectively [6]. Early efforts relied on Markov chains or factorization-based models to model item transitions [14]. However, these approaches were limited in their ability to capture long-range dependencies and non-linear patterns. The introduction of deep learning significantly advanced the field. GRU4Rec pioneered the use of Gated Recurrent Units (GRUs) to model sequential interactions, achieving improved performance by learning hidden states across time steps [5]. Later, NARM [15] combined GRUs with attention mechanism to explicitly model users’ intent enhancing the interpretability and expressiveness of the sequential model. Meanwhile, transformer-based architectures have set new benchmarks in this field. SASRec [2] employs self-attention mechanisms to capture dependencies across long sequences. BERT4Rec [7] extends this idea using bidirectional masked language modeling to learn item representations more effectively. TiSASRec [3] further integrates time intervals to model temporal gaps explicitly, offering more accurate sequence modeling in time-sensitive scenarios. However, these models primarily focus on item transitions within sequences and fail to capture the rich structural information underlying user–item interactions.
Graph-based collaborative filtering methods leverage the expressive power of graph neural networks to model high-order interaction signals. Early approaches such as GCMC [16], PinSage [17], and NGCF [18] propagate user–item interaction information through graph convolutions, enabling models to capture multi-hop dependencies that go beyond pairwise interactions. Subsequent work focuses on simplifying these architectures. LightGCN [19] removes feature transformation and non-linearities to retain pure collaborative signals, while LR-GCCF [20] introduces linear residual propagation to stabilize deep aggregation. Self-supervised variants such as SGL [21] and HCCF [22] create contrastive graph views through edge dropout, random perturbations, or hypergraph augmentations to improve robustness under sparsity. Despite their strengths, these graph-based models operate solely on the interaction graph and do not consider item semantics or contextual attributes. The resulting graph structure is typically fixed throughout training and may fail to represent relationships that do not appear in the historical interaction data. Moreover, these approaches do not explicitly model sequential patterns, making them less suitable for next-item prediction tasks where temporal order carries essential behavioral information.
Hybrid architectures combine graph modeling with sequential encoders to enrich item transition patterns using structural information. Several representative frameworks explore different ways of integrating graph signals into sequential modeling. GC-SAN [23] augments a self-attention encoder with an item-transition graph derived from session behaviors. The graph captures localized co-occurrences while the Transformer layers focus on temporal dependencies. However, the graph is limited to short-range transitions and does not encode global semantic relations. LESSR [24] constructs a session graph and introduces specialized shortcut connections to mitigate over-squashing during propagation. This design helps preserve long-range dependencies within sessions, but the graph remains session-specific and is not able to leverage cross-session global relationships. GCE-GNN [25] aggregates item co-occurrence statistics across the entire corpus, generating a global item graph to capture higher-order connectivity. While this improves over purely local-session methods, the constructed graph is still entirely behavior-driven and cannot reflect semantic similarities that do not manifest in interaction logs. Overall, existing hybrid models typically rely on transition graphs derived from behavioral patterns—either within sessions or across the whole dataset. These graphs are usually static and cannot adapt during training, and they seldom incorporate textual or semantic information that could offer complementary signals. These limitations indicate the need for architectures capable of leveraging richer item relationships (beyond co-occurrence) and supporting more flexible edge modeling while maintaining strong sequential modeling capabilities.
Recent efforts have explored multi-modal fusion [26], knowledge-aware recommendation [27,28], and language model-based item representations [29], enriching item semantics with textual content or external knowledge. Meanwhile, contrastive and self-supervised learning have been applied to sequential models to enhance robustness, as seen in S3-Rec [30], DuoRec [31], and SALARec [32]. Some hybrid models also attempt to combine global item–item graphs with local sequence modeling [33], but such fusion strategies often remain shallow and static, failing to adapt the graph structure in a personalized or dynamic way.
In parallel with graph-based and hybrid sequential models, recent work has explored leveraging LLMs directly as recommenders by converting recommendation tasks into natural-language generation or retrieval problems [34]. Recent works such as MLLM4Rec [35] integrate multimodal inputs (e.g., image descriptions) into LLMs via prompt construction and instruction tuning, Other approaches, such as LLM-Rec [36], formulate recommendation as a natural-language reasoning task by converting user behavior sequences into text prompts and letting LLMs infer user preference patterns through in-context learning. TokenRec [37] further tokenizes user–item interactions and represents recommendation as a next-token prediction problem. While these LLM-based approaches demonstrate strong capacity, they also face several practical limitations. First, generative inference is computationally expensive and does not scale well to large item catalogs. Second, LLMs are susceptible to hallucination [38]. Third, because these methods operate primarily through text-conditioned reasoning, they do not explicitly leverage collaborative signals or model global item–item structures. As a result, they may struggle to accurately capture fine-grained semantic proximity (e.g., substitutes versus complements) or exploit neighborhood information for long-tail items, which limits their ability to serve as a full replacement for graph-enhanced collaborative filtering.
Motivated by these gaps, we propose a novel semantic-aware and dynamically adaptive graph-enhanced sequential recommendation model, which constructs a global item–item graph from pretrained semantic embeddings, and refines this structure via edge-wise learning based on current user behaviors. By integrating this global graph into a transformer-based sequence encoder and jointly optimizing with self-supervised objectives, our method effectively captures both short-term user interests and global item semantics in a unified framework.

3. Problem Formulation and Notation

In this section, we formally define the sequential recommendation task and summarize the notations used throughout this paper. Let U = { u 1 , u 2 , , u M } denote the set of users and V = { v 1 , v 2 , , v N } denote the set of items in the system. Each user u U is associated with an interaction sequence S u = [ v 1 u , v 2 u , , v T u u ] , where each v t u V corresponds to an item that the user has interacted with at timestep t, and T u is the sequence length. The objective of sequential recommendation is to predict the next item v T u + 1 u that the user will engage with, conditioned on their historical interactions S u .
Formally, the model learns a parameterized scoring function:
f θ : S u r u R | V | ,
where θ represents the learnable parameters of the model, and r u is the ranking score vector over all candidate items. Higher scores in r u indicate stronger predicted relevance to the user’s next interaction. The top-K items with the highest scores are returned as the recommendation list for user u. To facilitate clarity and consistency, Table 1 summarizes the main mathematical notations adopted in the proposed framework.
Table 1. List of key notations used throughout the paper.

4. Methodology

In this paper, we propose a novel sequential recommendation framework named SAGERec that integrates global item–item graph learning with a self-attention-based network, as illustrated in Figure 1. The central motivation behind our approach is to jointly capture two complementary aspects of user–item interactions: the global semantic correlations among items and the sequential dependencies inherent in user behavioral patterns. Traditional sequential recommenders often rely solely on local temporal transitions, which limits their ability to recognize global contextual relevance between items that share similar content, brand, or functional semantics but may never co-occur within the same user sequence. To address this limitation, our framework incorporates a global semantic graph that establishes item–item relationships beyond co-interaction statistics, thereby introducing a new dimension of semantic awareness into the recommendation process. Specifically, we first utilize pretrained LLMs to encode structured item metadata into dense semantic embeddings. These embeddings capture fine-grained textual semantics and serve as the foundation for constructing a semantic-aware global graph, where edges represent the top-K most semantically similar relationships among items. Subsequently, a graph convolutional operation is applied to the item representations to aggregate and propagate information across the semantic graph. This process enriches each item’s embedding by incorporating signals from its semantically relevant neighbors. To enhance adaptability and mitigate noise from semantic similarity, we introduce a learnable edge-weight mechanism that dynamically refines the graph connections during training. Finally, the refined item embeddings are fed into a stack of self-attention-based sequential modeling blocks, which effectively capture long-range dependencies and temporal patterns in user interaction histories. The following sections provide a detailed description of each component.
Figure 1. The overall framework of our proposed model.

4.1. Embedding Layer

In the proposed framework, each item v i V is represented by a learnable dense embedding vector e i R d , where d denotes the latent dimensionality of the embedding space and V is the set of all unique items. All item representations are organized into a unified embedding matrix
E ( 0 ) = [ e 1 ; e 2 ; ; e | V | ] R | V | × d ,
where each row corresponds to the initial embedding of one item, and | V | denotes the total number of items in the catalog. At the beginning of training, the matrix E ( 0 ) is randomly initialized and subsequently optimized in an end-to-end manner through gradient backpropagation. Formally, for a given sequence of user interactions S u = [ v 1 , v 2 , , v T ] , the embedding layer maps each item index to its corresponding vector representation:
X u ( 0 ) = [ e v 1 , e v 2 , , e v T ] R T × d ,
where T denotes the sequence length. The resulting embedding matrix X u ( 0 ) provides the input to both the semantic graph enhancement module and the sequential encoder. Through this embedding initialization and mapping process, the model establishes a continuous latent space that facilitates subsequent information propagation across both semantic and temporal dimensions.

4.2. Semantic Graph Construction

To capture high-level semantic relationships among items beyond user–item co-occurrence patterns, we leverage pretrained LLMs to convert textual product metadata into dense semantic embeddings. These embeddings form the foundation for constructing a global item–item graph that captures semantic proximity in the latent space. For each item v i V , we first construct a natural language prompt by concatenating its key–value attributes, such as title, description, category, and brand. Formally, for each item v i V , we concatenate its key–value attributes, such as title, description, category, and brand, into a unified textual prompt that serves as input to the pretrained LLM encoder. Given a set of attribute–value pairs A i = { ( k 1 , a i ( 1 ) ) , ( k 2 , a i ( 2 ) ) , , ( k m , a i ( m ) ) } , the prompt is defined as:
Prompt ( v i ) =   " k 1 : a i ( 1 ) . k 2 : a i ( 2 ) . " ,
where each attribute name k j is paired with its textual content a i ( j ) . This natural-language prompt encapsulates multiple heterogeneous attributes into a coherent textual form, enabling the LLM to derive a semantically rich embedding that captures both descriptive and contextual nuances of the item.
The pretrained LLM encodes each prompt into a continuous semantic embedding via its encoder function f LLM ( · ) :
e i s = f LLM Prompt ( v i ) , e i s R d ,
where d denotes the embedding dimension.
The semantic embeddings for all items are then stacked into a matrix:
E s = ( e 1 s ) ( e 2 s ) ( e | V | s ) R | V | × d ,
which serves as the semantic embedding space used to construct the global item–item graph.
To model semantic proximity between items, we compute the pairwise similarity between their LLM-derived semantic embeddings using the cosine similarity function:
S i j s = ( e i s ) e j s e i s e j s ,
where S i j s measures the semantic correlation between items v i and v j , and e i s denotes the Euclidean norm of the semantic embedding vector e i s . A higher similarity score indicates that two items share stronger semantic relevance, such as similar textual descriptions or categories. To obtain a sparse yet informative semantic structure, we retain for each item its top-k most similar neighbors with similarity scores above a predefined threshold τ . Formally, an undirected edge ( i , j ) is created if and only if:
S i j τ , j Top - k ( i ) ,
where Top - k ( i ) denotes the set of items with the k highest similarity values to item i. The resulting edge set is defined as:
E ( 0 ) = { ( i , j ) S i j τ , j Top - k ( i ) } .
By combining the node set V and edge set E ( 0 ) , we construct the initial semantic graph:
G ( 0 ) = ( V , E ( 0 ) ) ,
which encodes the global semantic structure among items derived from pretrained language representations. This semantic graph acts as a complementary information source to the user–item interaction graph. To clarify the semantic graph construction process, consider a real example from the ML-1M dataset. Suppose an item corresponds to the movie “The Matrix (1999)” with genres Action, Sci-Fi. We first convert its metadata into a natural-language prompt, such as: “Title: The Matrix (1999). Genres: Action; Sci-Fi.” Feeding this prompt into the pretrained LLM encoder yields a semantic embedding e i s that captures the movie’s thematic elements. When computing cosine similarity with all other movies, films with similar themes such as “The Matrix Reloaded (2003)” or “Terminator 2: Judgment Day (1991)”, receive higher similarity scores. These movies therefore, appear in the item’s top-k semantic neighbors and form edges in the global graph. This example illustrates how raw metadata (title and genres) is transformed into LLM embeddings and then into meaningful semantic connections in the graph. In the next stage, we employ graph convolution to propagate and aggregate information across the semantic graph, thereby enriching each item embedding with contextual knowledge from its semantically related neighbors.

4.3. Adaptive Graph Learning

To effectively integrate semantic neighborhood information, we perform graph convolution over the constructed semantic graph G ( 0 ) = ( V , E ( 0 ) ) . This process allows each item to aggregate contextual signals from its semantically related neighbors, thereby refining its embedding representation in a structure-aware manner. Inspired by the findings of LightGCN [19], we remove unnecessary nonlinear transformations and intermediate weight matrices in order to preserve the pure collaborative signal and avoid information distortion during propagation. Moreover, a residual connection is introduced to maintain embedding stability and mitigate the over-smoothing effect that typically occurs in deeper GCN layers.
Formally, the propagation at the ( l + 1 ) -th layer is expressed as:
E ( l + 1 ) = E ( l ) + A ˜ E ( l ) ,
where E ( l ) R | V | × d denotes the item embedding matrix at layer l, and A ˜ is the symmetrically normalized adjacency matrix defined as:
A ˜ = D 1 2 A D 1 2 ,
where A is the adjacency matrix derived from E ( 0 ) , and D is the diagonal degree matrix with D i i = j A i j . In our default configuration, we do not include self-loops in the semantic adjacency matrix, as each item already retains its own representation through the residual connections in the GCN layer. For completeness, we further tested a variant where self-loops were added to the graph (i.e., A A + I ). Across all datasets and metrics, the differences remained negligible (less than 0.2%), indicating that self-loops do not provide additional benefits under our architecture. This is likely because the adaptive edge-learning module and residual pathways already preserve each item’s intrinsic information effectively. The residual connection preserves each item’s identity representation, mitigating the risk of representation collapse while allowing multi-hop semantic propagation.
Although the semantic graph G ( 0 ) captures global similarity derived from pretrained language representations, it remains static and cannot adapt to the dynamic optimization process of sequential recommendation. To introduce task-specific flexibility, we design an adaptive edge-learning mechanism that adjusts edge strengths based on the evolving item embeddings during training. For each edge ( i , j ) E ( 0 ) , its learnable weight w i j is computed using a lightweight multi-layer perceptron (MLP):
w i j = σ W 2 · ReLU W 1 [ e i e j ] + b 1 + b 2 ,
where W 1 R h × 2 d , W 2 R 1 × h , b 1 R h , and b 2 R are trainable parameters, and σ ( · ) is the sigmoid activation function. The learned scalar w i j [ 0 , 1 ] quantifies the importance of edge ( i , j ) and is used to reweight the adjacency matrix.
To improve generalization and reduce the risk of overfitting to noisy or weak semantic edges, we apply edge dropout during training. This operation randomly removes a portion of edges at each iteration, which encourages the model to rely on diverse semantic paths instead of depending too heavily on a small set of connections. We also tested adding dropout inside the edge MLP. This modification did not provide additional gains beyond the existing edge dropout and 2 regularization, so we adopt the simpler design in the final model. To summarize, the semantic graph is fixed after we build it from the LLM-based similarities and we do not recompute similarities during training. Edge dropout does not change the graph itself. It only drops a random portion of edges for that iteration and the next iteration samples a new random set, so the underlying structure always stays the same. This dropout simply works as stochastic regularization. We use a fixed graph because rebuilding it during training would be extremely expensive and would introduce instability, while the adaptive edge-weight module already allows the model to adjust connection strengths in a way that is responsive to the task.

4.4. Sequential Modeling Block

After obtaining graph-enhanced item representations, we employ a self-attention-based encoder to model users’ dynamic behavioral patterns. The goal of this module is to capture both short-term transitions and long-range dependencies between interacted items, thereby producing context-aware representations for accurate next-item prediction. Unlike traditional recurrent models that process items sequentially, the self-attention mechanism allows each item to attend to all previous interactions simultaneously, enabling a more flexible and global dependency modeling.
For each user u, let the interaction sequence be denoted as S u = [ v 1 , v 2 , , v T ] , where v t is the item interacting at timestep t. Each item v t is represented by its graph-enhanced embedding z v t R d , which has already incorporated semantic and structural information through the adaptive graph learning stage. To encode temporal order, we introduce a learnable positional embedding p t for each timestep, ensuring that the model can differentiate between items appearing in different positions of the sequence. The input embedding for timestep t is given as:
x t = z v t + p t .
All item embeddings in the sequence are then concatenated into a matrix:
X = [ x 1 , x 2 , , x T ] R T × d ,
which serves as the input to the sequential encoder. Within each self-attention layer l, contextualized representations are computed through scaled dot-product attention [39]:
Attention ( Q , K , V ) = softmax Q K d + M V ,
where M { 0 , } T × T is a lower-triangular causal mask that ensures autoregressive modeling by blocking information flow from future positions. The query, key, and value matrices are derived as:
Q = X W Q , K = X W K , V = X W V ,
where W Q , W K , W V R d × d are trainable projection matrices.
To enhance the model’s representation power, we apply multi-head attention, which allows the encoder to jointly attend to information from different representation subspaces. Each attention head independently learns a unique set of projections and outputs contextual representations that are later concatenated and linearly transformed. Formally, for H attention heads:
MultiHead ( X ) = Concat head 1 , , head H W O ,
where each head is computed as head h = Attention ( Q h , K h , V h ) , and W O R H d × d is the output projection matrix.
Each attention block is followed by a position-wise feed-forward network (FFN) and a residual connection with pre-layer normalization, which helps stabilize training and facilitate gradient flow across layers:
H ( l ) = X ( l ) + MultiHead LayerNorm ( X ( l ) ) ,
X ( l + 1 ) = H ( l ) + FFN LayerNorm ( H ( l ) ) ,
where FFN ( · ) denotes a two-layer fully connected network with ReLU activation:
FFN ( x ) = max ( 0 , x W 1 + b 1 ) W 2 + b 2 .
After stacking L self-attention layers, we obtain the final hidden state sequence X ( L ) = [ x 1 ( L ) , , x T ( L ) ] . The output representation at the last position, z T = x T ( L ) R d , serves as the user’s contextual embedding summarizing their most recent preferences and long-term interests. Finally, the relevance score between the user’s final hidden state and each candidate item v i is computed via the inner product:
r i , T = z T e i ,
where e i R d is the learnable embedding of item v i . Items are then ranked in descending order of r i , T , and the top-K items are recommended to the user. Through this self-attention-based sequential modeling block, the model effectively captures complex transition patterns and context-dependent user preferences, enabling precise and interpretable next-item prediction.

4.5. Training Strategy

The proposed model is trained under a next-item prediction objective, where the goal is to estimate the likelihood of the next item that a user will interact with, given their historical sequence. We formulate this as a binary classification task, optimized using binary cross-entropy (BCE) loss.
Each user interaction sequence is truncated or left-padded to a fixed maximum length L, ensuring uniform input dimensionality across the dataset. Sequences shorter than L are padded with index 0, which is masked during both training and inference to avoid gradient propagation through invalid positions. For each user u, a training instance is constructed by using the prefix sequence [ v 1 , v 2 , , v T 1 ] as the model input, and the next interacted item v T + as the positive label. To encourage discriminative learning, we adopt negative sampling, where items that the user has not interacted with are randomly selected as negative samples v T . The model is trained to assign higher prediction scores to positive items than to negative ones.
Formally, for each target position T, the loss for a single training instance is defined as [2]:
L BCE = log σ ( y ^ v T + ) log ( 1 σ ( y ^ v T ) ) ,
where y ^ v denotes the predicted relevance score for item v, and σ ( · ) is the sigmoid activation function. This objective encourages the model to output higher scores for items that are more likely to appear next in a user’s sequence.
To ensure training stability, the loss is computed only on valid sequence positions, excluding padding tokens. We also apply 2 -regularization on the learnable embedding parameters to prevent overfitting and improve generalization. The overall optimization objective can thus be expressed as:
L = L BCE + λ Θ 2 2 ,
where λ is a regularization coefficient and Θ represents all trainable parameters of the model.

5. Experiments

This section presents comprehensive experiments conducted to evaluate the effectiveness, robustness, and generality of the proposed SAGERec framework. We begin by describing the datasets and evaluation metrics used in our study, followed by detailed implementation settings and baseline comparisons. Subsequently, we report the overall performance across multiple datasets, analyze the contribution of each model component through ablation studies, and further investigate how the proposed semantic graph enhancement affects different encoder architectures. Through these analyses, we aim to provide a complete understanding of how each design choice contributes to the overall performance and to demonstrate the versatility of the proposed framework across diverse recommendation scenarios.

5.1. Dataset Description

We evaluate our proposed model on three benchmark datasets: Amazon-Beauty, Amazon-Toys, and MovieLens-1M (ML-1M). The Amazon-Beauty and Amazon-Toys subsets are extracted from the public Amazon Review dataset, which contains user-generated product reviews and purchase histories across various categories. ML-1M is a widely used benchmark for movie recommendation, containing approximately one million explicit user ratings collected from the MovieLens platform. Following common practice, we treat ratings of 4 and above as positive implicit feedback. Compared with the Amazon datasets, ML-1M exhibits denser user–item interactions and longer behavioral sequences, enabling the evaluation of model performance under high-frequency engagement scenarios. For all datasets, we retain users with at least five interactions to ensure sufficient historical context for sequential modeling. User interactions are chronologically sorted to construct ordered sequences. Each sequence is split using a leave-one-out strategy, where the most recent item is used for testing, the second most recent for validation, and the remaining items for training. This setting simulates realistic next-item prediction scenarios, where a model predicts a user’s upcoming action based on past behaviors. For the Amazon datasets, product metadata such as titles, descriptions, brands, and categories are merged to enrich the semantic information of item representations. For the ML-1M dataset, we extract each movie’s title and genre as auxiliary semantic features, allowing the model to capture content-aware relationships between items. The detailed statistics of the processed datasets, including user, item, and interaction counts, sparsity, and sequence-related attributes, are summarized in Table 2.
Table 2. Statistics of the datasets after preprocessing, where ‘#’ denotes the number of entries.

5.2. Evaluation Metrics

We evaluate the recommendation performance of all models using two widely adopted ranking metrics, Recall and NDCG (Normalized Discounted Cumulative Gain), computed at cutoff values 10 and 20. For each test case, the model produces a relevance score for every item and the top-K items form the recommendation list. Recall@K measures whether the ground-truth next item appears in the top-K list, reflecting the model’s ability to retrieve the correct next item. NDCG@K further accounts for the ranking position of the ground-truth item by assigning higher credit when it appears closer to the top of the list. For a given user u, Recall@K is defined as
Recall @ K = | R u K T u | | T u | ,
where R u K denotes the set of top-K items recommended to user u, and T u represents the ground-truth items that user u actually interacted with in the test set.
NDCG@K evaluates not only whether relevant items appear in the recommendation list but also their ranking positions. It assigns higher weights to relevant items ranked closer to the top, thereby capturing ranking quality more effectively. For user u, it is formulated as
NDCG @ K = 1 IDCG @ K i = 1 K I ( r u , i T u ) log 2 ( i + 1 ) ,
where I ( · ) is the indicator function that equals 1 if the item ranked at position i belongs to T u , and r u , i denotes the i-th recommended item for user u. IDCG @ K is the ideal discounted cumulative gain—the maximum possible DCG@K when all relevant items are ranked at the top positions. Finally, the average Recall@K and NDCG@K are computed across all users in the test set to assess the overall recommendation performance. Higher values of both metrics indicate stronger ranking capability and recommendation accuracy.

5.3. Experimental Settings

We compare our proposed model with a diverse set of representative baselines that span various model architectures, including traditional collaborative filtering, sequential neural models, recent self-supervised or graph-enhanced frameworks, and LLM-enhanced sequential recommender. The compared methods are summarized as follows:
  • BPR [1]: A classic matrix factorization model optimized via Bayesian personalized ranking loss, which learns user–item latent factors from implicit feedback.
  • GRU4Rec [5]: A recurrent neural network-based sequential recommender that models user behavior through gated recurrent units to capture temporal dependencies.
  • Caser [6]: A convolutional neural network-based model that learns sequential patterns by applying horizontal and vertical convolutional filters over recent interaction embeddings.
  • SASRec [2]: A transformer-based sequential recommender that employs self-attention to capture long-range dependencies within user interaction sequences.
  • BERT4Rec [7]: An extension of SASRec using bidirectional self-attention and a masked-item prediction objective, enabling context-aware sequence modeling in both forward and backward directions.
  • S3-Rec [30]: A self-supervised learning framework that jointly optimizes multiple pretext tasks to align user, item, and attribute representations across hierarchical semantic spaces.
  • CL4SRec [12]: A contrastive self-supervised sequential recommender that enhances user representations through sequence augmentation and positive–negative sample alignment.
  • DSSRec [40]: A disentangled self-supervised sequential recommendation model that separates latent factors into independent dimensions to capture diverse user intents.
  • SINE [41]: A graph-based model that constructs sparse item–item relationships for sequential recommendation, improving representation efficiency and scalability.
  • TransRec [29]: A transition-based LLM recommender that bridges item and language spaces through multi-facet identifiers and position-free constrained generation, enabling more accurate grounding of generated tokens to in-corpus items.
For our model, we set the number of self-attention blocks to 2, the hidden dimension to 50, the batch size to 128, the learning rate to 0.001, and the L2 regularization coefficient on the embedding parameters to 1 × 10 5 . Following established practice in sequential recommendation [2,12], we set shorter maximum sequence lengths (50) for the sparser Beauty and Toys datasets and a longer length (165) for the denser ML-1M dataset. This principle is based on the observation that sparse datasets contain many short or noisy interaction histories, where long truncation introduces excessive padding and unstable learning, whereas denser datasets benefit from retaining longer behavioral patterns. Although the exact values vary across studies, our settings follow this commonly adopted logic. A more systematic investigation of sequence length choices and methods for better modeling very long user histories will be explored in future work. We set the hidden size of the adaptive edge learning MLP to h = d , matching the dimensionality of the item embeddings for simplicity and stable performance. All parameters in the model are initialized using the Xavier uniform scheme. We adopt one graph convolutional layer to propagate item relationships, with a node dropout rate of 0.5 and an edge dropout rate of 0.2 to alleviate overfitting. Model parameters are optimized using the Adam optimizer, with β 1 = 0.9 , β 2 = 0.999 , and an ϵ value of 1 × 10 8 . The learning rate is decayed by a factor of 0.1 if the validation loss does not improve for five consecutive epochs. We adopt a negative sampling strategy similar to established sequential recommendation work, such as SASRec [2]. For each user–target pair, we randomly and uniformly sample 100 negative items that the user has never interacted with. This approach provides a sufficiently strong contrastive signal while avoiding the heavy computation required to evaluate all user–item pairs. Early stopping is applied based on the best NDCG@10 performance on the validation set. For constructing the semantic graph, we select the top-K most similar neighbors for each item. Empirical validation across all datasets shows that K = 3 provides the best balance between capturing meaningful semantic relations and avoiding the noise and computational overhead introduced by larger K. Thus, we adopt K = 3 as the default in all experiments. To encode item-level semantic information, we employ the Sentence-BERT MiniLM-L12-v2 model [42] in a frozen manner. This lightweight variant provides an effective balance between semantic quality and computational efficiency. MiniLM-L12-v2 supports a maximum input length of 128 tokens; therefore, prompts exceeding this limit are automatically truncated. Note that the LLM-based semantic embeddings and the resulting global semantic graph are only used by SAGERec. For text-based methods such as TransRec, we provide the required item titles and attributes in accordance with their original implementations. All hyperparameters of sequence-based baselines (e.g., SASRec, BERT4Rec, CL4SRec) follow the recommended settings in their original papers, and we use publicly released code for all models whenever available. This ensures a fair comparison where each method operates under its intended input modality and standard configuration.
All experiments are conducted on a single NVIDIA A100 GPU. On the Amazon Beauty dataset, SAGERec requires approximately 4.4 s per epoch and converges in about 150 epochs, resulting in a total training time of roughly 11 min under our hyperparameter settings. For comparison, the contrastive learning-based baseline CL4SRec requires around 4.7 s per epoch on the same hardware but typically needs more than 200 epochs to converge due to its augmentation and contrastive optimization. As a result, its total training time is longer. Overall, these results indicate that SAGERec is computationally efficient compared other advanced model and practical for real-world deployment.

5.4. Overall Performance

Table 3 presents the Top-10 recommendation performance comparison on the Amazon Beauty, Amazon Toys, and ML-1M datasets. Furthermore, Table 4 summarizes the Top-20 recommendation performance comparison to demonstrate the consistency of our model. To ensure statistical reliability, all results are averaged over five independent runs with different random seeds. We further conduct paired t-tests comparing SAGERec with the second-best method, and mark improvements with * when the difference is significant at p < 0.01 . Several consistent observations can be made. First, traditional methods such as BPR exhibit the weakest performance due to their limitation to model temporal dependencies or sequential dynamics. Early neural approaches, including GRU4Rec and Caser, achieve limited improvements, indicating the challenge of capturing long-range dependencies solely through recurrent or convolutional structures. Transformer-based models such as SASRec and BERT4Rec substantially outperform these earlier methods, demonstrating the effectiveness of self-attention in modeling user behavioral patterns over long sequences.
Table 3. Top-10 recommendation performance on Beauty, Toys, and ML-1M datasets. Bold indicates the best result; underline marks the second best. A superscript “*” denotes statistically significant improvement over the strongest baseline (two-sided paired t-test, p < 0.01 ).
Table 4. Top-20 recommendation performance on Beauty, Toys, and ML-1M datasets. Bold indicates the best result; underline marks the second best. A superscript “*” denotes statistically significant improvement over the strongest baseline (two-sided paired t-test, p < 0.01 ).
Recent self-supervised and intent-aware models further enhance sequential representations. CL4SRec, DSSRec, and SINE all achieve competitive results by incorporating contrastive objectives or disentangled latent factors, which improve robustness and interpretability. Among these, CL4SRec remains one of the strongest baselines. Across all datasets, our proposed model achieves the best overall performance on nearly every metric. On the Amazon Beauty dataset, it surpasses the strongest baseline (CL4SRec) by 3.2% and 2.6% in NDCG@10 and NDCG@20, respectively, and by 1.4% and 3.8% in Recall@10 and Recall@20. On the Amazon Toys dataset, our model achieves relative improvements of 5.4% in NDCG@10 and 2.7% in NDCG@20, demonstrating its ability to generalize across distinct e-commerce domains. The most significant gains appear on the ML-1M dataset, where our model improves Recall@10 and Recall@20 by 19.5% and 15.2%, respectively, and NDCG@10 and NDCG@20 by 9.2% and 8.3%. We attribute these consistent gains to the integration of a semantic global graph constructed from SBERT-encoded product metadata and learned through a GCN layer. This graph enables the model to propagate semantic relationships between related items—such as similar descriptions, brands, or genres—beyond co-occurrence patterns in user behavior.
The improvement is particularly pronounced on ML-1M, which may be attributed to a stronger alignment between the semantic space derived from LLM-generated embeddings and the behavioral co-occurrence patterns in the movie domain. In this dataset, users who watch movies of similar genres or narrative styles often exhibit correlated preferences, and such relationships are likely to be captured more effectively by LLM-based semantic similarities. Consequently, the global graph provides more behaviorally coherent neighborhoods, allowing GCN propagation to refine item embeddings in a way that complements sequential modeling. In contrast, although the Amazon datasets contain richer textual metadata, their semantic and behavioral spaces are less aligned. Product descriptions and categories are highly diverse and sometimes emphasize non-behavioral aspects, such as marketing or stylistic expressions, which weakens the consistency between semantic proximity and actual user interactions. As a result, the graph aggregation may introduce more noise, yielding relatively smaller gains.
We additionally compare SAGERec with TransRec, a recent LLM-based sequential recommendation model that formulates next-item prediction as a constrained text generation task using multi-facet item identifiers. As shown in Table 3 and Table 4, TransRec achieves competitive performance and consistently outperforms earlier Transformer-based models such as SASRec and BERT4Rec, demonstrating the benefit of leveraging richer textual semantics through a large language model. However, TransRec still falls short of state-of-the-art self-supervised methods such as CL4SRec and remains noticeably behind SAGERec across all datasets and evaluation metrics. TransRec relies entirely on a pretrained LLM to interpret item semantics through natural language generation. However, its LLM-centric approach does not explicitly capture global item–item relationships, nor does it adapt these relationships during training. In contrast, SAGERec augments LLM-derived semantic embeddings with a global item graph and an adaptive edge-learning module that refines structural relations based on task-specific signals. This complementary use of semantic information and graph structure leads to more robust item representations, particularly on sparse datasets, and results in consistently stronger performance.

5.5. Ablation Study

To comprehensively evaluate the contribution of each component in the proposed framework, we conduct a series of ablation experiments on three benchmark datasets: Amazon Beauty, Amazon Toys, and ML-1M. The following variants are considered: (A) Full Model, representing the complete SAGERec framework; (B) w/o Edge Learning, which removes the adaptive edge-weight module and directly uses the initial cosine similarity as fixed edge weights; (C) w/o Graph, where the semantic graph and graph convolution are completely removed, resulting in a standard transformer-based sequential recommender; (D) w/ Random Graph, which replaces the semantic graph with a randomly connected graph having the same number of edges; and (E) w/o Position Embedding, which removes positional encoding to assess the importance of temporal order awareness in sequential modeling.
The results in Table 5 reveal several clear trends. First, the full model (A) consistently achieves the best performance across all datasets, confirming that each module contributes synergistically to the final recommendation quality. Removing edge learning (B) leads to a moderate performance drop, indicating that dynamically learning edge weights enables the model to adapt the semantic graph to evolving item representations rather than relying solely on static similarities. When the graph module is removed (C), performance decreases significantly across all metrics, highlighting that semantic-aware graph convolution effectively enriches item representations through neighborhood aggregation. Using a random graph (D) produces the worst results, showing that the structural prior derived from SBERT-based semantic embeddings provides meaningful relational information that cannot be replaced by random connectivity. Finally, removing the positional embedding (E) also causes consistent degradation, confirming that temporal order remains an essential factor in capturing user behavioral dynamics even with the addition of global semantic context.
Table 5. Ablation study on Beauty, Toys, and ML-1M datasets (Top-20 evaluation). Each variant removes or modifies one component of SAGERec.
Overall, these ablation results demonstrate that both the semantic graph and the adaptive edge-learning mechanism are critical to achieving superior performance. The joint modeling of global semantic relationships and sequential dependencies allows the proposed framework to generate richer and more robust item representations across different recommendation domains.

5.6. Effect of Encoder Choice

To further examine the generality of our proposed semantic graph enhancement, we conduct additional experiments by integrating it with three representative sequential encoders of different architectural paradigms: GRU4Rec (RNN-based), BERT4Rec (Transformer-based), and CL4SRec (contrastive learning-based). For each encoder, we train two variants: the original implementation and the graph-enhanced version incorporating our semantic-aware global graph module. The evaluation results on the Amazon Beauty dataset are shown in Figure 2. The figure reports four widely used ranking metrics—Recall@10, NDCG@10, Recall@20, and NDCG@20, allowing a direct comparison of top-K recommendation quality across different encoder architectures. Higher values indicate better retrieval performance.
Figure 2. Performance comparison of three encoder architectures (GRU4Rec, BERT4Rec, and CL4SRec) on the Amazon Beauty dataset before and after integrating the proposed semantic graph module. Each bar represents a ranking metric: Recall@10, NDCG@10, Recall@20, and NDCG@20. Higher values indicate better top-K recommendation quality.
As shown in Figure 2, our semantic graph consistently enhances the performance of all three encoders, demonstrating the broad applicability of the proposed strategy. The improvement is particularly pronounced for GRU4Rec, where Recall@20 and NDCG@10 increase by approximately 10–20%. This substantial gain suggests that the incorporation of global semantic relations can effectively compensate for the limited receptive field of RNN-based models, enabling them to capture higher-order dependencies and contextual relevance between items beyond local sequence transitions. For BERT4Rec, we observe notable improvements, especially in Recall@20 and NDCG@20, indicating that the semantic graph provides complementary global information to the bidirectional self-attention mechanism. This additional context enhances the quality of learned item representations and helps the model achieve a more accurate ranking of relevant items. In the case of CL4SRec, the improvement is relatively smaller yet consistent. This is expected, as CL4SRec already employs a contrastive self-supervised learning objective that implicitly encourages semantic consistency among similar items, leaving less room for additional gains.
Overall, these results confirm that the proposed semantic-aware global graph can serve as a universal enhancement module that effectively improves sequential encoders of various architectures. It provides complementary semantic structure information that reinforces both short- and long-range dependencies, resulting in improved ranking accuracy and more robust sequential representations across different modeling paradigms.

5.7. Complexity and Efficiency Analysis

To clarify the efficiency of our model, we compare SAGERec with SINE, a representative graph-enhanced sequential recommender that also relies on item–item structural information. Both models operate on the same set of 12,101 items in the Amazon Beauty dataset, but they differ substantially in how graph structures are constructed and used during training. SINE forms an item–item similarity graph based on co-occurrence statistics and aggregates neighbors at every training iteration. Its per-epoch complexity consists of (1) sparse neighbor aggregation with cost O ( | V | · k · d ) , and (2) a multi-interest fusion module requiring O ( B · T · d ) . Although the graph is sparse, SINE also maintains a large concept-interaction structure whose effective connection size is approximately | V | × N c , leading to higher memory consumption and computation. In contrast, SAGERec constructs its semantic item–item graph offline using LLM-based embeddings. The resulting graph is extremely sparse, with only k = 3 neighbors per item (approximately 3.6 × 10 4 edges). During training, SAGERec performs a LightGCN-style propagation without transformation layers, yielding a lightweight graph cost of O ( | V | · k · d ) . Combined with a standard Transformer encoder of complexity O ( B · T 2 · d ) , the overall training procedure avoids SINE’s multi-interest components and retains only minimal sparse propagation overhead.
Table 6 summarizes the node and edge counts, memory usage, and per-epoch runtime. SAGERec requires fewer structural connections, consumes less memory, and trains faster than SINE under identical settings, confirming that the proposed semantic graph module is computationally efficient in practice.
Table 6. Complexity and efficiency comparison on the Amazon Beauty dataset. All experiments run on an NVIDIA A100 GPU. ‘#’ denotes the number of entries.

5.8. Case Analysis on Long-Tail Problem

To further examine whether SAGERec can benefit long-tail items, we conduct a small-scale cold-start evaluation on the Amazon Beauty dataset. We select 100 items that never appear in the training set but appear as ground-truth targets in the test set. Since these items have no behavioral history, they represent extreme long-tail cases where sequence-only models struggle. On this subset, SASRec fails to rank any of the cold items correctly (NDCG@10 = 0), reflecting its dependence on co-occurrence signals. CL4SRec achieves a modest improvement (NDCG@10 = 0.0075), suggesting that contrastive learning helps capture more generalizable representations. SAGERec achieves an NDCG@10 of 0.0148, nearly doubling the performance of CL4SRec. This gain aligns with the design intuition of our model. Even when an item has no interactions, it can still obtain informative representations through the semantic graph and the adaptive edge-learning module, which allow it to absorb structural and semantic signals from its graph neighbors. Beyond metrics, we analyze a representative cold item (ID 352). This item has zero interactions in the training data and would normally lack any behavioral signal. In the semantic graph constructed by SAGERec, item 352 is connected to item 4537, which appears more than ten times in the training set. During graph convolution, item 352 receives informative messages from item 4537 and thereby acquires meaningful semantic structure. This behavior is confirmed in the learned embeddings. Under SASRec, the cosine similarity between item 352 and its semantically related neighbor item 4537 is only 0.07, indicating almost no relational information. After applying our graph-enhanced learning, the cosine similarity increases to 0.35. This jump demonstrates that SAGERec effectively propagates semantic information to cold-start items through the adaptive graph. Although SAGERec is not specifically designed for cold-start recommendation, these results indicate that graph-enhanced semantic learning provides a promising direction for addressing long-tail challenges. Further exploration of this capability is an important direction for future work.

6. Discussion

This work is motivated by two key limitations of existing sequential recommendation models. On the one hand, Transformer-based architectures such as SASRec and BERT4Rec mainly focus on local user interaction sequences and have limited awareness of global item–item relations, which constrains their ability to generalize beyond frequently co-occurring items. On the other hand, graph-based or hybrid graph–sequential methods typically construct graphs directly from behavioral co-occurrence (e.g., session transitions or global co-click statistics), so the resulting structure may miss semantically related items that have not yet appeared together in user histories. To bridge these gaps, this paper proposes SAGERec, a semantic-aware global graph-enhanced sequential recommendation framework that leverages LLM-derived item representations to construct a global semantic graph and adaptively refine its edge strengths, and then integrates the resulting semantic-enhanced item embeddings into a lightweight Transformer encoder for more expressive and context-aware sequence modeling. Extensive experiments on three benchmark datasets show that this design consistently improves both Recall@K and NDCG@K over strong baselines, indicating that global semantics and adaptive graph learning provide complementary signals to pure sequence modeling.
Compared with earlier graph-based and hybrid sequential models, SAGERec presents fundamental architectural differences in both how item relationships are constructed and how sequential dependencies are modeled. Traditional GNN-based recommenders such as NGCF, LightGCN, and DGCF rely solely on the user–item interaction graph, meaning that item relationships only arise from behavioral co-occurrence and cannot capture semantic relatedness that does not appear in the logs. Even more advanced variants such as DGCF or SGL emphasize disentangled factors or contrastive augmentation, yet still operate on the same behavior-driven topology and cannot introduce new edges that reflect content-level similarity. Sequential models such as SASRec, BERT4Rec, or CL4SRec, on the other hand, completely discard graph structure and rely purely on order-aware Transformer encoders. While effective for short-term dynamics, they lack any mechanism to provide global contextual signals or higher-order relational structure among items. Hybrid models like GC-SAN, LESSR, and SINE extend sequential recommendation by injecting graph-based context, but their graphs remain behavior-centric: GC-SAN and LESSR build localized transition graphs from sessions, and SINE constructs a concept-level interest graph derived from co-occurrence patterns. These graphs are static, limited to the observed interaction space, and do not exploit semantic information from item metadata or textual attributes. In contrast, SAGERec differs at a foundational level. Instead of relying on behavioral transitions, its item–item graph is constructed from LLM-derived semantic embeddings that encode rich textual metadata. This enables the graph to reveal meaningful global structure—including associations among long-tail or sparsely interacted items—that cannot be recovered from logs alone. Furthermore, unlike fixed-graph hybrid models, SAGERec introduces an adaptive edge-weight learning mechanism that jointly optimizes semantic connections together with the recommendation objective. Finally, the semantic-enhanced item representations are fed into a lightweight Transformer encoder, enabling the model to integrate global semantic context with local sequential preference signals in a unified architecture. This combination of LLM-derived semantics, adaptive graph refinement, and Transformer-based sequence modeling establishes SAGERec as a fundamentally different paradigm from prior graph-based, sequential, and hybrid approaches.
Despite the effectiveness of the proposed framework, several limitations remain. First, SAGERec relies on textual metadata to construct the semantic graph, and its performance is therefore bounded by the quality and informativeness of item descriptions. In domains where product texts are vague, repetitive, or promotional—such as certain Amazon categories—the semantic embeddings extracted by pretrained language models may introduce noise rather than provide meaningful relational signals. Second, although the adaptive edge-weight module refines semantic connections during training, the underlying graph itself is static because it is derived from fixed textual content. Real-world user preferences, however, are dynamic and may shift in response to temporal factors such as seasonal trends or evolving market interests. A static semantic graph may not fully reflect these changing behavioral patterns. Third, because the semantic embeddings originate from general-purpose pretrained language models, a domain gap may arise between generic semantic similarity and task-specific relational structures (e.g., substitute vs. complementary products). This misalignment may limit the model’s ability to represent fine-grained functional relationships that are crucial for accurate recommendations. Finally, while SAGERec effectively mitigates item-side sparsity by connecting long-tail or rarely interacted items through semantic neighbors, its ability to improve recommendations for users with extremely short interaction histories remains limited, as no user-specific side information is incorporated.

7. Conclusions

7.1. Main Findings

In this study, we proposed SAGERec, a novel framework that integrates a semantic-based global item graph with sequential recommendation models. The experimental results demonstrate that incorporating global semantic relationships significantly improves recommendation performance compared to strong baseline models. Specifically, SAGERec improves NDCG@10 by 3.2–9.2% and Recall@10 by 1.1–19.5% over the strongest baselines on Beauty, Toys, and ML-1M, with even larger gains at top-20 evaluation where improvements reach 8.3% (NDCG@20) and 15.2% (Recall@20) on ML-1M. These results validate that incorporating global semantic relationships substantially enhances next-item prediction, particularly in domains where item descriptions align well with user preferences. By providing additional context regarding item-to-item semantic relatedness, our approach effectively enriches item representations, which proves particularly beneficial in scenarios where user interaction histories are short or sparse. This advantage is most pronounced in datasets such as ML-1M, where item descriptions and genres closely align with user behavioral choices. From a practical perspective, SAGERec offers substantial value for real-world recommendation systems. Since the semantic graph is constructed offline, the framework does not incur additional online serving costs, making it efficient for deployment. Furthermore, the semantic connectivity allows new or long-tail items to enter the recommendation cycle earlier, addressing the cold-start problem by linking them to related items even in the absence of extensive interaction data. Overall, this work highlights the importance of combining complementary signals—behavioral and semantic—to build more robust and accurate recommendation models.

7.2. Limitations and Future Work

Despite the effectiveness of the proposed framework, there are several limitations that define the scope of its applicability and point toward directions for future research. First, the model’s performance is heavily dependent on the quality and informativeness of item textual metadata. In certain domains, particularly on e-commerce platforms like Amazon, product descriptions may contain vague, repetitive, or promotional content that does not reflect user interests. In such cases, semantic similarity may introduce noise rather than helpful signals, leading to smaller performance gains compared to domains with high-quality metadata. Seocnd, although SAGERec employs learnable edge weights to refine the graph, the underlying semantic structure is constructed from static text information. Real-world user interests are dynamic and evolve due to trends, seasons, or lifestyle shifts. A static semantic graph may fail to capture these time-dependent changes in user preferences. Moreover, the semantic graph relies on embeddings from pre-trained language models (e.g., SBERT). There is a potential domain gap where the general semantic knowledge captured by these models may not fully align with specific domain relationships (e.g., substitute vs. complementary items in retail). This misalignment can limit the graph’s ability to capture functional item relationships accurately. Finally, while SAGERec can alleviate item-side sparsity through the global semantic graph to some extent, its improvement for users with extremely short interaction histories is less direct and needs further research. Future work will focus on addressing these limitations by exploring dynamic graph structures that evolve over time, investigating domain-adaptive language models to better align semantics with behavior, exploring user-side data augmentation or user-specific side information, and integrating multi-modal information such as images to enhance robustness against noisy text.

Author Contributions

Conceptualization, W.C. and H.-K.L.; methodology, W.C.; software, W.C.; validation, W.C. and H.-K.L.; formal analysis, W.C.; writing—original draft preparation, W.C.; writing—review and editing, H.-K.L.; visualization, W.C.; supervision, H.-K.L.; project administration, H.-K.L. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the China Scholarship Council (CSC) and King’s College London through the K-CSC Joint Scholarship Programme.

Data Availability Statement

The data presented in this study are available on request from the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Rendle, S.; Freudenthaler, C.; Gantner, Z.; Schmidt-Thieme, L. BPR: Bayesian personalized ranking from implicit feedback. arXiv 2012, arXiv:1205.2618. [Google Scholar] [CrossRef]
  2. Kang, W.C.; McAuley, J. Self-attentive sequential recommendation. In Proceedings of the 2018 IEEE International Conference on Data Mining (ICDM), Singapore, 17–20 November 2018; pp. 197–206. [Google Scholar]
  3. Li, J.; Wang, Y.; McAuley, J. Time interval aware self-attention for sequential recommendation. In Proceedings of the 13th International Conference on Web Search and Data Mining (WSDM), Houston, TX, USA, 3–7 February 2020; pp. 322–330. [Google Scholar]
  4. Tang, M.; Cui, S.; Jin, Z.; Liang, S.n.; Li, C.; Zou, L. Sequential recommendation by reprogramming pretrained transformer. Inf. Process. Manag. 2025, 62, 103938. [Google Scholar] [CrossRef]
  5. Hidasi, B.; Karatzoglou, A.; Baltrunas, L.; Tikk, D. Session-based recommendations with recurrent neural networks. arXiv 2015, arXiv:1511.06939. [Google Scholar]
  6. Tang, J.; Wang, K. Personalized top-n sequential recommendation via convolutional sequence embedding. In Proceedings of the eleventh ACM International Conference on Web Search and Data Mining, Marina Del Rey, CA, USA, 5–9 February 2018; pp. 565–573. [Google Scholar]
  7. Sun, F.; Liu, J.; Wu, J.; Pei, C.; Lin, X.; Ou, W.; Jiang, P. BERT4Rec: Sequential recommendation with bidirectional encoder representations from transformer. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management, Beijing, China, 3–7 November 2019; pp. 1441–1450. [Google Scholar]
  8. Latifi, S.; Jannach, D.; Ferraro, A. Sequential recommendation: A study on transformers, nearest neighbors and sampled metrics. Inf. Sci. 2022, 609, 660–678. [Google Scholar] [CrossRef]
  9. Wang, S.; Hu, L.; Wang, Y.; Cao, L.; Sheng, Q.Z.; Orgun, M. Sequential recommender systems: Challenges, progress and prospects. arXiv 2019, arXiv:2001.04830. [Google Scholar] [CrossRef]
  10. Gao, C.; Zheng, Y.; Li, N.; Li, Y.; Qin, Y.; Piao, J.; Quan, Y.; Chang, J.; Jin, D.; He, X.; et al. A survey of graph neural networks for recommender systems: Challenges, methods, and directions. ACM Trans. Recomm. Syst. 2023, 1, 1–51. [Google Scholar] [CrossRef]
  11. Li, J.; Wang, M.; Li, J.; Fu, J.; Shen, X.; Shang, J.; McAuley, J. Text is all you need: Learning language representations for sequential recommendation. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Long Beach, CA, USA, 6–10 August 2023; pp. 1258–1267. [Google Scholar]
  12. Xie, X.; Sun, F.; Liu, Z.; Wu, S.; Gao, J.; Zhang, J.; Ding, B.; Cui, B. Contrastive learning for sequential recommendation. In Proceedings of the 2022 IEEE 38th International Conference on Data Engineering (ICDE), Kuala Lumpur, Malaysia, 9–12 May 2022; pp. 1259–1273. [Google Scholar]
  13. Duan, W.; Liang, D. User disambiguation learning for precise shared-account marketing: A hierarchical self-attentive sequential recommendation method. Knowl.-Based Syst. 2025, 315, 113328. [Google Scholar] [CrossRef]
  14. Rendle, S.; Freudenthaler, C.; Schmidt-Thieme, L. Factorizing personalized markov chains for next-basket recommendation. In Proceedings of the 19th International Conference on World Wide Web, Raleigh, NC, USA, 26–30 April 2010; pp. 811–820. [Google Scholar]
  15. Li, J.; Ren, P.; Chen, Z.; Ren, Z.; Lian, T.; Ma, J. Neural attentive session-based recommendation. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, Singapore, 6–10 November 2017; pp. 1419–1428. [Google Scholar]
  16. Berg, R.v.d.; Kipf, T.N.; Welling, M. Graph convolutional matrix completion. arXiv 2017, arXiv:1706.02263. [Google Scholar] [CrossRef]
  17. Ying, R.; He, R.; Chen, K.; Eksombatchai, P.; Hamilton, W.L.; Leskovec, J. Graph convolutional neural networks for web-scale recommender systems. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, London, UK, 19–23 August 2018; pp. 974–983. [Google Scholar]
  18. Wang, X.; He, X.; Wang, M.; Feng, F.; Chua, T.S. Neural graph collaborative filtering. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval, Paris, France, 21–25 July 2019; pp. 165–174. [Google Scholar]
  19. He, X.; Deng, K.; Wang, X.; Li, Y.; Zhang, Y.; Wang, M. Lightgcn: Simplifying and powering graph convolution network for recommendation. In Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval, Virtual Event, China, 25–30 July 2020; pp. 639–648. [Google Scholar]
  20. Chen, L.; Wu, L.; Hong, R.; Zhang, K.; Wang, M. Revisiting graph based collaborative filtering: A linear residual graph convolutional network approach. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 27–34. [Google Scholar]
  21. Wu, J.; Wang, X.; Feng, F.; He, X.; Chen, L.; Lian, J.; Xie, X. Self-supervised graph learning for recommendation. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, Virtual Event, Canada, 11–15 July 2021; pp. 726–735. [Google Scholar]
  22. Xia, L.; Huang, C.; Xu, Y.; Zhao, J.; Yin, D.; Huang, J. Hypergraph contrastive collaborative filtering. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, Madrid, Spain, 11–15 July 2022; pp. 70–79. [Google Scholar]
  23. Xu, C.; Zhao, P.; Liu, Y.; Sheng, V.S.; Xu, J.; Zhuang, F.; Fang, J.; Zhou, X. Graph contextualized self-attention network for session-based recommendation. In Proceedings of the IJCAI, Macao, China, 10–16 August 2019; Volume 19, pp. 3940–3946. [Google Scholar]
  24. Chen, T.; Wong, R.C.W. Handling information loss of graph neural networks for session-based recommendation. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, Virtual Event, CA, USA, 6–10 July 2020; pp. 1172–1180. [Google Scholar]
  25. Wang, Z.; Wei, W.; Cong, G.; Li, X.L.; Mao, X.L.; Qiu, M. Global context enhanced graph neural networks for session-based recommendation. In Proceedings of the 43rd international ACM SIGIR Conference on Research and Development in Information Retrieval, Virtual Event, China, 25–30 July 2020; pp. 169–178. [Google Scholar]
  26. Hu, H.; Guo, W.; Liu, Y.; Kan, M.Y. Adaptive multi-modalities fusion in sequential recommendation systems. In Proceedings of the 32nd ACM International Conference on Information and Knowledge Management, Birmingham, UK, 21–25 October 2023; pp. 843–853. [Google Scholar]
  27. Zou, D.; Wei, W.; Wang, Z.; Mao, X.L.; Zhu, F.; Fang, R.; Chen, D. Improving knowledge-aware recommendation with multi-level interactive contrastive learning. In Proceedings of the 31st ACM International Conference on Information & Knowledge Management, Atlanta, GA, USA, 17–21 October 2022; pp. 2817–2826. [Google Scholar]
  28. Wang, H.; Zhang, F.; Xie, X.; Guo, M. DKN: Deep knowledge-aware network for news recommendation. In Proceedings of the 2018 World Wide Web Conference, Lyon, France, 23–27 April 2018; pp. 1835–1844. [Google Scholar]
  29. Lin, X.; Wang, W.; Li, Y.; Feng, F.; Ng, S.K.; Chua, T.S. Bridging items and language: A transition paradigm for large language model-based recommendation. In Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Barcelona, Spain, 25–29 August 2024; pp. 1816–1826. [Google Scholar]
  30. Zhou, K.; Wang, H.; Zhao, W.X.; Zhu, Y.; Wang, S.; Zhang, F.; Wang, Z.; Wen, J.R. S3-rec: Self-supervised learning for sequential recommendation with mutual information maximization. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management, Virtual Event, Ireland, 19–23 October 2020; pp. 1893–1902. [Google Scholar]
  31. Qiu, R.; Huang, Z.; Yin, H.; Wang, Z. Contrastive learning for representation degeneration problem in sequential recommendation. In Proceedings of the Fifteenth ACM International Conference on Web Search and Data Mining, Virtual Event, AZ, USA, 21–25 February 2022; pp. 813–823. [Google Scholar]
  32. Wang, P.; Cui, W. SALARec: Dual-Alignment Contrastive Learning with Preference-Aware Adversarial Augmentation for Sequential Recommendation. Expert Syst. Appl. 2025, 299, 130070. [Google Scholar] [CrossRef]
  33. Wang, J.; Ding, K.; Hong, L.; Liu, H.; Caverlee, J. Next-item recommendation with sequential hypergraphs. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, Virtual Event, China, 25–30 July 2020; pp. 1101–1110. [Google Scholar]
  34. Wang, Q.; Li, J.; Wang, S.; Xing, Q.; Niu, R.; Kong, H.; Li, R.; Long, G.; Chang, Y.; Zhang, C. Towards next-generation llm-based recommender systems: A survey and beyond. arXiv 2024, arXiv:2410.19744. [Google Scholar]
  35. Wang, Y.; Shi, X.; Zhao, X. Mllm4rec: Multimodal information enhancing llm for sequential recommendation. J. Intell. Inf. Syst. 2025, 63, 745–761. [Google Scholar] [CrossRef]
  36. Lyu, H.; Jiang, S.; Zeng, H.; Xia, Y.; Wang, Q.; Zhang, S.; Chen, R.; Leung, C.; Tang, J.; Luo, J. Llm-rec: Personalized recommendation via prompting large language models. In Proceedings of the Findings of the Association for Computational Linguistics: NAACL 2024, Mexico City, Mexico, 16–21 June 2024; pp. 583–612. [Google Scholar]
  37. Qu, H.; Fan, W.; Zhao, Z.; Li, Q. Tokenrec: Learning to tokenize id for llm-based generative recommendations. IEEE Trans. Knowl. Data Eng. 2025, 37, 6216–6231. [Google Scholar] [CrossRef]
  38. Huang, L.; Yu, W.; Ma, W.; Zhong, W.; Feng, Z.; Wang, H.; Chen, Q.; Peng, W.; Feng, X.; Qin, B.; et al. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions. ACM Trans. Inf. Syst. 2025, 43, 1–55. [Google Scholar] [CrossRef]
  39. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30. [Google Scholar]
  40. Ma, J.; Zhou, C.; Yang, H.; Cui, P.; Wang, X.; Zhu, W. Disentangled self-supervision in sequential recommenders. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, Virtual Event, CA, USA, 6–10 July 2020; pp. 483–491. [Google Scholar]
  41. Tan, Q.; Zhang, J.; Yao, J.; Liu, N.; Zhou, J.; Yang, H.; Hu, X. Sparse-interest network for sequential recommendation. In Proceedings of the 14th ACM International Conference on Web Search and Data Mining, Virtual Event, Israel, 8–12 March 2021; pp. 598–606. [Google Scholar]
  42. Reimers, N.; Gurevych, I. Sentence-bert: Sentence embeddings using siamese bert-networks. arXiv 2019, arXiv:1908.10084. [Google Scholar] [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.