End-to-End Personalization via Unifying LLM Agents and Graph Attention Networks for Entertainment Recommendation

Ebrat, Danial; Ahmadian, Sepideh; Rueda, Luis

doi:10.3390/info17040344

Open AccessArticle

End-to-End Personalization via Unifying LLM Agents and Graph Attention Networks for Entertainment Recommendation^†

by

Danial Ebrat

,

Sepideh Ahmadian

and

Luis Rueda

^*

School of Computer Science, University of Windsor, 401 Sunset Ave, Windsor, ON N9B 3P4, Canada

^*

Author to whom correspondence should be addressed.

^†

This article is an extended version of a paper entitled “Vectorized Context-Aware Embeddings for GAT-Based Collaborative Filtering”, presented at The 38th Canadian Conference on Artificial Intelligence, 2025, Calgary, Canada 26–29 May 2025, and a workshop paper entitled “End-to-End Personalization: Unifying Recommender Systems with Large Language Models”, presented at the KDD Workshop on Second Workshop on Generative AI for Recommender Systems and Personalization, held in conjunction with the ACM Conference on Knowledge Discovery and Data Mining (KDD), Toronto, Canada, 4 August 2025.

Information 2026, 17(4), 344; https://doi.org/10.3390/info17040344

Submission received: 1 February 2026 / Revised: 20 March 2026 / Accepted: 31 March 2026 / Published: 2 April 2026

(This article belongs to the Special Issue 2nd Edition of Modern Recommender Systems: Approaches, Challenges and Applications)

Download

Browse Figures

Versions Notes

Abstract

Recommender systems are central to helping users navigate the rapidly expanding entertainment ecosystem, yet achieving strong personalization with limited feedback while maintaining interpretability remains difficult, particularly under cold-start conditions and heterogeneous item metadata. This work presents an end-to-end hybrid recommendation framework that unifies a Large Language Model (LLM) with Graph Attention Network (GAT)-based collaborative filtering to improve both ranking accuracy and explanation quality across movies, books, and music. LLM-based agents first transform raw metadata such as titles, genres, descriptions, and auxiliary attributes into semantically grounded user and item profiles, which are embedded and used as initial node features in a user–item bipartite graph processed by a GAT-based recommender. Model optimization relies on a hybrid objective combining Bayesian Personalized Ranking, cosine-similarity regularization, and robust negative sampling to better align semantic and collaborative signals. Finally, in the post-processing stage, an LLM-based agent re-ranks the GAT outputs using a proposed Hybrid Confidence-Weighted Binary Search Tree, and another LLM-based agent that produces natural-language justifications tailored to each user. Experiments on diverse benchmark datasets and extensive ablations demonstrate that the proposed methodology increases precision, recall, NDCG, and MAP across various values of K. In addition, the post processing step is especially effective in cold-start scenarios, consistently strengthening recommendation metrics and enhancing transparency at smaller values of K. Overall, integrating LLM-enriched representations with attention-based graph modeling enables more accurate and explainable entertainment recommendations.

Keywords:

recommender systems; large language models; graph attention networks; entertainment recommendation; hybrid recommendation models; cold-start recommendation; explainable recommendation systems

1. Introduction

The growing demand for personalized, context-aware, and interpretable recommendation systems has led to a surge of interest in integrating LLMs into the recommendation pipeline. Conventional recommender systems suffer from data sparsity, shallow contextual understanding, and the need for extensive manual feature engineering. LLMs represent a paradigm shift, offering richer feature representations, adaptive reasoning, and the ability to enhance transparency throughout the recommendation process.

Recent research has explored various avenues through which LLMs contribute to recommendation systems, including data augmentation, feature generation, re-ranking, and explanation generation. Several comprehensive surveys have analyzed their transformative influence across the recommendation pipeline. For instance, Zhao et al. [1] provide a comprehensive taxonomy of LLM-augmented recommenders, categorizing LLMs both as recommendation models and as auxiliary tools employed across diverse recommendation tasks. Complementing this perspective, studies by Wu et al. [2] and Lin et al. [3] emphasize the integration of LLMs in machine learning workflows beyond direct recommendation, underscoring their utility in data collection, preprocessing, knowledge alignment, ranking optimization, and pipeline enhancement.

We focus on entertainment recommendations and limit our study to music, movies, and books. This choice is motivated by fundamental psychological differences between entertainment preferences and utilitarian consumption behavior. In contrast to e-commerce settings, where purchasing decisions are often driven by immediate needs and situational factors, entertainment preferences tend to remain stable over time and are closely linked to personality traits. Prior work has shown that the Five-Factor Model reliably predicts consistent patterns in music preferences, movie genre choices, and reading habits, with personality traits explaining more variance than contextual influences [4]. This stability supports long-term user preference modeling and personality-aware recommendation strategies, setting up entertainment recommendation systems apart from transactional domains that rely on session-based or context-dependent approaches.

This article represents a substantially extended and thoroughly revised version of our prior work presented at the KDD Workshop [5] and the 38th Canadian Conference on Artificial Intelligence [6]. These earlier studies established the feasibility of integrating Large Language Models with Graph Attention Networks for collaborative filtering within a primarily movie-centric setting. Building upon this foundation, the present work advances the framework both methodologically and empirically by introducing a more scalable and generalizable architecture tailored to multi-domain entertainment recommendation. Specifically, we redesign the preprocessing pipeline through a modular agent-oriented workflow, develop an enhanced multi-step profile generation, and propose an intelligent post-processing framework featuring confidence-aware reranking alongside a natural-language explanation agent. Furthermore, we expand the empirical evaluation across multiple datasets to strengthen the validity and external applicability of the approach. Collectively, these advancements substantially broaden the scope, rigor, and practical relevance of our previous research.

Contributions

This article presents an expanded and methodologically advanced framework for end-to-end personalization through the tight LLM agents with attention-based graph learning. Extending our earlier work on hybrid semantic–graph recommendation, we develop a more scalable, generalizable, and interpretable architecture designed for complex entertainment ecosystems. The proposed framework strengthens representation learning, optimization robustness, and recommendation transparency while supporting reliable personalization under sparse and cold-start conditions. Unlike our prior studies, which focused primarily on feasibility, this work emphasizes architectural maturity, scalability, and empirical generalization. The principal contributions are summarized as follows:

A Generalizable End-to-End Personalization Architecture: We propose a unified recommendation paradigm that extends hybrid graph–semantic modeling from a single-domain setting to a multi-domain entertainment environment. This generalization enables the system to capture richer behavioral signals while improving robustness across heterogeneous item spaces and enabling cross-domain preference modeling.
Agent-Oriented Semantic Infrastructure for Representation Learning: We introduce a modular agentic design that restructures the preprocessing stage into an intelligent semantic infrastructure capable of transforming raw, heterogeneous metadata into structured representations. This design enhances scalability, promotes architectural flexibility, and supports reproducible feature construction for graph-based learning.
Refined Feature Initialization for Attention-Based Collaborative Filtering: We develop an improved feature construction methodology that strengthens the alignment between semantic representations and relational graph structure, thereby improving the capacity of the GAT to learn meaningful preference patterns under sparsity and cold-start conditions.
Post-Processing Intelligence through Confidence-Aware Agentic Reranking: We design a structured post-processing framework in which LLM-driven agents perform confidence-aware reranking over candidate recommendations. This mechanism introduces an additional reasoning layer that consistently improves recommendation quality while mitigating early-stage ranking noise.
Explainable Recommendation through Natural-Language Justification Agents: To address the growing demand for transparent AI systems, we integrate an explanation agent that generates context-sensitive natural-language rationales, thereby enhancing interpretability without sacrificing predictive performance.
Comprehensive Empirical Validation Across Domains and Datasets: We conduct extensive experiments spanning multiple datasets and entertainment modalities, in general cases, cold start scenarios, and warm cases, demonstrating consistent gains in ranking metrics and particularly strong improvements in cold-start scenarios. These findings highlight the framework’s capacity for reliable personalization in data-constrained environments.

2. Literature Review

This section reviews the relevant literature across three key dimensions: (1) the use of LLMs for feature engineering and representation learning, (2) LLMs for ranking refinement and interpretability, and (3) the use of Graph Attention Networks in recommender systems.

2.1. LLMs for Feature Engineering

Feature engineering plays a critical role in recommender system performance, although it has traditionally required extensive manual effort and domain expertise. Recent advances in LLMs have shifted this paradigm by enabling automated, context-aware transformation of raw textual data into semantically enriched representations. These capabilities are especially valuable under sparse or noisy data conditions, where conventional techniques struggle to capture nuanced user–item relationships. By acting as dynamic knowledge sources, LLMs generate auxiliary semantic features that enhance user preference modeling and item understanding, fundamentally reshaping data preprocessing and representation learning in recommendation pipelines.

Prior work on LLM-based representation enhancement can be broadly categorized into user- and item-level feature augmentation and instance-level sample generation, following the taxonomy of Lin et al. [3]. While instance-level methods focus on synthetic data generation, this work concentrates on entity-level semantic enrichment, using LLMs to construct structured user and item profiles. Prompt-based feature extraction has been shown to effectively capture latent semantic patterns from raw data, supporting downstream recommendation tasks across diverse settings.

Several representative methods demonstrate the promise and limitations of LLM-driven feature engineering. KAR extracts user preference knowledge and item factual knowledge as plug-in features for conventional models. However, it only relies on static feature construction that limits adaptability [7]. SAGCN employs chain-based prompting to extract aspect-aware semantic interactions, though its performance is highly sensitive to prompt design [8]. CUP addresses input length constraints through compact user profile summarization, improving efficiency at the cost of fine-grained preference detail [9]. L3AE distills LLM-derived item semantics into a linear autoencoder framework, achieving computational efficiency but limiting expressiveness and user-side semantic modeling, particularly in sparse or noisy text settings [10].

Graph-based approaches have further integrated LLM representations into collaborative filtering. LLM-Augmented Graph Neural Recommenders combine review-derived embeddings with GNNs using a hybrid objective but depend heavily on review availability and incur high computational costs [11]. Knowledge-aware methods augment graphs using LLM-generated features and preference nodes, yet face challenges related to factual correctness, domain brittleness, and maintenance overhead [12]. Domain-specific systems such as LLaMA-E [13] and EcomGPT [14] show strong performance in e-commerce through fine-tuning, but their generalizability beyond targeted domains remains limited [1,2]. Similar concerns arise in educational recommendation, where naïve fusion of LLM-generated semantic features can introduce redundancy and noise [15].

Beyond recommendation, LLMs have been applied to knowledge graph completion, text refinement, attribute generation, and user interest modeling, demonstrating their effectiveness in mitigating sparsity and cold-start issues [16,17,18,19]. In this context, MI4Rec introduces meta-item embeddings for cold-start recommendation, though interpretability and content dependency remain open challenges [20].

Despite these advances, LLM-based feature engineering risks introducing semantic noise or bias if outputs are not carefully structured and validated. The key challenge lies in ensuring semantic coherence and alignment with downstream models. In this paper, we address this issue through an end-to-end personalization framework that integrates LLMs and GATs in a principled, schema-aligned manner. Our approach leverages LLMs for structured semantic preprocessing that directly informs embedding initialization and graph-based learning, improving personalization robustness and generalizability. Building on our prior work, which demonstrated the effectiveness of LLMs for dynamic and explainable user modeling, we further refine their role in representation learning for scalable recommendation systems [5].

2.2. LLMs for Ranking and Interpretability

Ranking plays a critical role in shaping user experience, influencing not only which items are recommended but also the order in which they are prioritized. Conventional models, including matrix factorization [21], sequence-based predictors [22], and graph neural networks [23], offer strong performance while often lacking transparency and interpretability—qualities increasingly demanded in sensitive domains such as health, finance, and education. Traditional ranking models have relied on neural embeddings generated by feature encoders and various machine learning techniques, including collaborative filtering [24], yet struggle to incorporate the nuanced contextual understanding that LLMs provide [24].

With the emergence of LLMs, a new paradigm has been introduced, leveraging their advanced reasoning abilities and contextual understanding to enhance ranking effectiveness. In modern recommendation systems, scoring and ranking functions serve the fundamental purpose of generating a ranked list of items that best match a user’s preferences. Traditionally, this involved models estimating utility scores based on latent user–item interactions. LLMs, however, have expanded these capabilities by incorporating textual data, neural embeddings, or a combination of both to perform ranking in more flexible and context-aware ways. This evolution has led to the development of LLM-based ranking approaches that improve user experience by refining item selection and ordering mechanisms.

Existing LLM-based ranking systems fall into three primary categories. First, item scoring approaches treat LLMs as pointwise estimators, computing preference scores for user–item pairs that dictate the ranking order of candidate items. A common challenge in using LLMs for this task is their inherent design for token generation rather than numerical scoring. Researchers have developed three major solutions to address this limitation. Single-tower models, such as E4SRec and ClickPrompt, replace the language modeling decoder head with a multi-layer perceptron to generate continuous preference scores, offering efficiency while limiting the expressive capabilities of the underlying model [25]. Two-tower frameworks like CoWPiRec and TASTE maintain separate towers for user and item representations and use distance metrics for preference scores, providing scalability through modeling only shallow interactions and relying on fixed similarity metrics [26]. Classification approaches like TALLRec reformulate ranking as prediction tasks yet often struggle with score calibration in multi-item settings [27].

Second, item generation approaches involve LLMs directly producing a ranked list of items based on user inputs, relying on the model’s reasoning abilities to infer preferences. This process can be categorized into open-set item generation, where LLMs generate recommendations without predefined candidate sets, and closed-set item generation, where a lightweight retrieval model pre-filters candidate items for LLM-based ranking. Open-set generation has been explored in models such as LANCER and Di Palma et al., offering greater flexibility but suffering from generative hallucinations [28]. Closed-set generation, utilized in works like LlamaRec and DRDT, is more stable but constrained by pre-filtered candidate sets [29,30]. Studies reveal that the order of candidate items in prompts can influence LLM ranking decisions, introducing potential biases. Thus, selecting between these approaches depends on application-specific constraints and performance trade-offs [31].

Third, hybrid approaches that integrate both scoring and generation mechanisms offer promising directions but remain underexplored. To address these limitations, our approach combines graph-based representation learning with a lightweight LLM reranker. By decoupling ranking from generation and leveraging semantically structured profiles, we achieve higher interpretability, lower computational cost, and stronger alignment with user intent. This design choice allows us to harness the relational strength of graph models while maintaining the contextual awareness that LLMs provide.

Lastly, LLMs have significantly transformed result ranking in recommendation systems by enabling more nuanced and context-aware ranking mechanisms. While their integration has improved personalization and interpretability, challenges remain in optimizing computational efficiency, reducing biases, and mitigating generative hallucinations. Future research should explore hybrid models that combine scoring and generation tasks while refining prompt engineering techniques for more reliable ranking outputs. Addressing these challenges will be key to unlocking the full potential of LLMs in recommendation systems.

2.3. Graph Attention Networks in Recommendation Systems

Graph-based collaborative filtering methods have become foundational in modern recommender systems due to their ability to model high-order interactions in user–item bipartite graphs. Traditional techniques like matrix factorization (e.g., SVD, ALS) struggle to incorporate contextual signals and often underperform in cold-start scenarios [32]. Early propagation-based models, including ItemRank and BiRank, improved upon this by diffusing preferences across the graph structure, yet lacked trainable parameters, reducing their expressiveness and limiting their ability to adapt to complex interaction patterns [33,34].

The introduction of Graph Neural Networks (GNNs) transformed this landscape by enabling message-passing architectures to capture both local and global interaction patterns. GC-MC and PinSage demonstrate how incorporating node features and graph topology can enhance recommendation accuracy [35,36]. NGCF introduced explicit multi-hop connectivity through stacked convolutions, enabling the model to capture collaborative signals across multiple hops in the user–item graph [21], though its complexity raised concerns about overfitting and computational scalability. LightGCN addressed this by simplifying the architecture, removing activation functions, and emphasizing pure neighborhood aggregation, demonstrating that cleaner propagation mechanisms could achieve competitive performance with reduced complexity [37]. Alternative approaches like TextGCN apply parameter-free graph convolution layers to LLM-derived item embeddings, subsequently training a two-tower MLP for in-domain performance enhancement [38]. While computationally efficient, the parameter-free graph propagation steps may lack adaptability to domain-specific interaction patterns, particularly when the underlying interaction graph is sparse or exhibits systematic biases, and the approach remains predominantly focused on item embedding propagation rather than comprehensive user profile generation.

GATs further extend this paradigm by assigning adaptive weights to neighbors during message passing, allowing for more selective and context-sensitive representation learning. Attention-based recommendation models like IGAT and TKGAT improved flexibility by dynamically weighting neighborhood contribution, though at the cost of scalability and interoperability in large-scale deployments [39,40]. The attention mechanism enables the model to discriminate between more and less relevant neighbors, capturing heterogeneous relationship strengths that uniform aggregation schemes cannot express.

RLMRec proposes a model-agnostic framework that bridges LLM-generated semantic representations with graph-based collaborative filtering through a cross-view alignment objective. The framework uses LLMs to construct user and item profiles, encodes them into dense semantic vectors, and aligns these frozen representations with the collaborative embeddings learned by the CF backbone via contrastive or generative alignment losses throughout training. While RLMRec demonstrates consistent improvements over purely ID-based CF models, its alignment-during-training strategy introduces a competing objective that can constrain the backbone’s capacity to exploit structural signals, particularly in attention-based architectures where representational flexibility is essential [41].

In this work, we adopt a lightweight GAT architecture that integrates LLM-derived semantic profiles as node features, combining the relational strength of GNNs with the contextual richness of LLMs. This fusion improves performance in sparse regimes and supports personalization through adaptive, meaningful attention. By initializing node embeddings with semantically enriched representations, our approach enhances the expressiveness of the attention mechanism and enables the model to leverage both structural and semantic information for superior recommendation quality. Our framework builds upon related efforts in unifying LLM-generated semantic profiles with graph-based collaborative filtering, addressing inherent limitations through enhanced multi-turn user preference modeling and integrated reranking mechanisms. While such multi-stage pipelines (LLM preprocessing → GAT training → LLM reranking) can introduce diagnostic complexity and potential latency issues, our schema-aligned approach ensures semantic consistency and provides interpretable justifications, mitigating concerns about metadata completeness dependency and end-to-end trainability.

3. Methodology

This section presents the proposed methodology, which is organized into three distinct yet interrelated phases: (i) pre-processing, LLM-based profile generation, and feature engineering, (ii) collaborative filtering model development and training, and (iii) LLM-based post-processing and explainability. Each phase is designed to ensure a coherent integration of LLMs within the recommender system pipeline, with explicit justification for their role at each stage. Figure 1 provides an overview of the high-level methodological framework, illustrating the sequential flow of these phases and their interactions within the overall system architecture.

3.1. Item Profile Preprocessing

We propose an LLM-driven item profile preprocessing pipeline that transforms heterogeneous and sparsely structured metadata into standardized, semantically rich representations. This stage is critical, as the quality and consistency of item profiles directly affect collaborative learning and downstream LLM-based reasoning.

Unlike traditional static feature engineering, our approach employs domain-specialized LLM agents for movies (MovieLens datasets [42]), books (Goodbook dataset [43]), and music (music4all [44]). Each agent processes raw or weakly structured metadata—including textual descriptions when available—to generate structured item profiles that capture latent attributes, themes, stylistic characteristics, and contrastive signals absent from the original data. All profiles conform to a shared schema, ensuring interoperability across domains and strict alignment with user profile representations.

3.1.1. Item Profile Generation

Item profile generation follows five design principles: (i) semantic expressiveness to expose high-level concepts beyond raw metadata; (ii) domain-aware generation via expert role prompting; (iii) structural consistency through fixed schemas; (iv) operational robustness using validation, retries, and incremental persistence; and (v) scalability through item-wise, resumable processing for large catalogs.

3.1.2. Item Profile: Across Content Domains

Item profile generation is implemented as a unified agent-driven pipeline that adapts to domain-specific characteristics while preserving a consistent output format. Rather than employing isolated preprocessing workflows for movies, books, and music, the system uses domain-specialized agents operating under a shared operational framework with domain-specific parsing strategies, enrichment logic, and semantic emphasis.

Across all domains, processing begins with raw or weakly structured item features such as titles, genres, release or publication information, and available descriptions. These inputs are transformed by the corresponding agent into structured, semantically enriched item profiles serialized in markdown format, ensuring compatibility with embedding models, similarity search, and agent-based reasoning modules.

Domain-specific adaptations are introduced at the semantic enrichment stage. For movies, external metadata sources are incorporated to augment sparse descriptions, enabling richer narrative and thematic modeling. Movie profiles emphasize narrative structure, genre and thematic attributes, and explicitly include contrastive Dislikes components that capture absent or opposing themes. These negative semantic signals improve boundary definition in embedding spaces and support contrastive representation learning.

For books, the generation process emphasizes literary characterization and bibliographic completeness. Given the relatively limited and structured nature of book metadata, the agent adopts a librarian and literary analyst role to infer fine-grained genre distinctions, stylistic elements, and narrative scope. Book profiles are generated via a two-stage process: an initial LLM output constrained by a strict JSON schema, followed by validation and conversion to standardized markdown. This intermediate validation step ensures structural correctness and reduces downstream parsing errors.

Music items follow a more selective generative strategy due to the availability of lyrical content and the adequacy of existing structured metadata. Generative modeling is applied primarily to truncated lyrical excerpts to balance semantic coverage and computational efficiency. The agent produces concise semantic summaries—typically one to two sentences—focusing on emotional tone and thematic content rather than technical musical attributes. Artist, album, and genre metadata are incorporated directly without generative transformation, yielding compact and embedding-efficient profiles. Figure 2 provides an overview of the methodological framework for both item preprocessing and user preprocessing discussed in the following section, illustrating the sequential flow.

Across all domains, the pipeline operates at the item level and incorporates robust safeguards, including retrying mechanisms, incremental saving, safe overwriting with backups, and failure isolation. These features ensure resilience to rate limits, API failures, and interruptions, enabling scalable preprocessing over large catalogs. Collectively, this adaptive yet unified item profile preprocessing strategy establishes a semantically rich and structurally consistent foundation for the hybrid recommender system developed in subsequent sections.

3.2. User Profile Preprocessing

The preprocessing methodology is dataset-agnostic and is applied uniformly across all evaluated domains (movies, books, and music). While the underlying content differs by domain, the interaction schema, profile structure, and agentic workflow remain identical, ensuring methodological consistency and comparability across datasets. The primary objective of this stage is to generate structured textual user representations that are semantically aligned with item profiles from the previous stage, thereby enabling meaningful user–item similarity computation and downstream LLM-based reasoning. Figure 3 represents sample user profiles and item profiles that will be generated after this pipeline.

The proposed preprocessing pipeline is guided by four primary design objectives. First, semantic alignment is prioritized to ensure that user and item representations share an identical structural and conceptual format, enabling direct comparison. Second, preference expressiveness is enhanced by modeling user tastes in natural language rather than sparse numerical features, allowing for multi-faceted and nuanced preference capture. Third, interpretability is emphasized through the generation of human-readable profiles that support explainable recommendation outcomes. Finally, scalability and robustness are addressed through a design that supports large-scale processing while remaining resilient to sparse data and system interruptions.

3.2.1. Input and Output Data Representation

The preprocessing pipeline consumes two complementary inputs. The first consists of explicit user–item interaction records, each containing a user identifier, an item identifier, and an explicit feedback signal expressed on a fixed ordinal 1–5 rating scale. The second input comprises structured textual item profiles generated in the previous stage for all items in the catalog; these profiles serve as the semantic basis for user profile construction.

The output of this stage is a single structured textual profile per user, strictly conforming to the same schema used for item profiles. This exact structural mirroring ensures compatibility with downstream embedding models and reasoning components. Generated profiles are stored in a machine-readable format suitable for large-scale retrieval, similarity computation, and integration into hybrid recommender system architectures.

A critical design constraint in the preprocessing pipeline is the strict avoidance of data leakage between profile construction and model evaluation. User profiles are generated exclusively from interactions present in the training split. In datasets with temporal information, this means that only interactions occurring before the defined temporal boundary are used during profile synthesis. Test set interactions are not visible to the LLM agents at any stage of profile construction. This constraint is enforced programmatically: the preprocessing pipeline receives only the training interaction records as input, and test set items are excluded from the selection of positively and negatively rated items used for preference synthesis and dislike identification. As a result, the semantic user representations reflect only historical preference signals, accurately simulating the information state available at deployment time.

3.2.2. User Profile Generation Strategy

User interaction histories are first filtered using predefined rating thresholds to distinguish between positive and negative preference signals. Ratings above a high threshold are treated as indicators of positive preference (4 and 5), while ratings below a low threshold are interpreted as negative preference signals (1 and 2). Interactions with mid-scale ratings are excluded due to their inherent ambiguity, following prior findings that extreme ratings convey higher informational value in preference elicitation.

For each user, up to ten positively rated items and ten negatively rated items are selected. Items are ranked primarily by rating strength (rated 5 comes before 4) and secondarily by interaction recency when temporal metadata is available. This selection strategy balances representational diversity with noise reduction while remaining compatible with LLM context length constraints.

The pipeline incorporates explicit mechanisms to handle sparse or incomplete interaction histories. Users with limited positive feedback are processed using all available positive signals. If no negative feedback is present, the dislikes component is omitted without introducing artificial assumptions. These design choices preserve semantic validity while ensuring robustness to data sparsity.

User profile generation is implemented as a sequential two-agent LLM-based workflow. The first agent synthesizes user preferences from positively rated items, while the second agent identifies aversions from negatively rated items conditioned on the generated preferences. This decomposition reduces cognitive load per agent, minimizes logical inconsistencies between likes and dislikes, and improves overall output coherence by enabling context-aware reasoning across stages.

3.2.3. Preference and Dislike Modeling

The preference synthesis agent processes structured profiles of positively rated items to generate the core components of the user profile, including a high-level summary, consolidated preferred attributes, and a descriptive narrative. The agent is constrained to use only vocabulary present in the source item profiles, merge semantically equivalent or closely related attributes, and conform strictly to a predefined output schema. Rather than enumerating item-specific features, the agent focuses on identifying recurring preference patterns across the user’s positively rated items. This vocabulary constraint applies specifically to the user profile synthesis stage and is intentionally scoped to prevent the introduction of semantic content that is inconsistent with the item catalog. It does not contradict the semantic enrichment performed during item profile generation in Section 3.1. At the item preprocessing stage, LLM agents augment sparse or weakly structured raw metadata—including external sources for movies and lyrical content for music—with latent thematic attributes, narrative descriptors, and contrastive signals that are absent from the original dataset. This enriched vocabulary is subsequently encoded into the item profiles. The user profile synthesis agent then operates over these already-enriched profiles, constructing user representations using the semantically expanded vocabulary established during item preprocessing. Therefore, the constraint ensures that user representations remain grounded in the item space, supporting coherent similarity computation, without limiting the expressive range of that space. In other words, semantic enrichment occurs at the item level; the vocabulary constraint at the user level guarantees alignment between the two representation types.

The dislike identification agent operates on negatively rated items, while being explicitly conditioned on the preferences generated by the first agent. Its role is to identify consistent aversions that refine the user’s preference for space without contradicting or duplicating established preferences. Dislikes are therefore modeled as contrastive boundaries rather than absolute negations, yielding more precise and semantically coherent user representations.

3.2.4. Implementation and Validation

A central design decision is the exact structural alignment between user and item profiles, which enables both to be embedded into a shared vector space and compared directly using standard similarity measures. The preprocessing pipeline is parallelized across five worker processes, each handling an independent subset of users. This design achieves near-linear scalability, supports incremental execution, and provides fault tolerance through checkpoint-based resumption, preventing unnecessary precomputation. Generated profiles undergo multi-level validation, including schema compliance checks, semantic consistency verification between preferences and dislikes, and manual sampling-based quality assessment. These measures ensure structural correctness, logical coherence, and overall output fidelity prior to downstream use.

Furthermore, GPT-5-nano was selected as the LLM backbone for all generative steps in this pipeline for several complementary reasons. First, the role of the LLM in this system is instructional rather than creative: each agent is given a precisely defined schema and constrained output format, and the model is expected to populate that structure faithfully rather than exercise open-ended generation. Under these conditions, a lightweight model operating on well-specified prompts consistently achieves output fidelity comparable to larger reasoning models, without incurring their latency or cost. Second, GPT-5-nano offers low inference latency per call, which is critical given that profile generation and reranking are applied at scale across thousands of users and items. Third, its API-accessible deployment is compatible with parallel execution across worker processes, allowing for scaling without requiring local GPU infrastructure. Finally, to support reproducibility and further research, the pipeline is implemented in a model-agnostic manner and is fully compatible with locally hosted open-source models available through Ollama, enabling ablation studies that substitute alternative LLMs without modifying the broader system architecture.

3.3. Model Training

Following profile preprocessing, the proposed pipeline transitions from semantic representation construction to collaborative refinement. In this phase, the recommender is formulated as a graph-based collaborative filtering problem in which user–item interactions define a bipartite structure, and LLM-derived semantic embeddings serve as informed initial conditions rather than handcrafted features. The central objective is to learn task-adapted user and item representations that preserve the interpretability and cold-start utility of semantic profiles while improving ranking quality through interaction-driven optimization. Figure 4 illustrates the high-level procedure for model training.

3.3.1. Semantic Embedding Initialization

Each user profile and item profile generated in Section 3.1 and Section 3.2 is converted into a fixed-dimensional dense vector using all-MiniLM-L6-v2 embedding model. This embedding step maps both entity types into a shared semantic space, which is critical to maintain comparability between user and item representations and to support subsequent graph message passing over heterogeneous nodes. The resulting vectors are used to initialize a unified node feature matrix, where user nodes occupy the first block of rows, and item nodes occupy the second block. Unlike conventional collaborative filtering pipelines that initialize latent factors randomly, the proposed approach uses these profile-derived vectors as trainable parameters. Consequently, training does not discard the semantic signal introduced by the LLM; instead, it refines it in a data-driven manner according to observed interaction structure.

This design is motivated by the observation that LLM-based profiles encode high-level attributes and latent preferences that are difficult to infer from sparse interaction matrices alone, especially under cold-start or long-tail conditions. Allowing these embeddings to remain trainable ensures that the model can correct profile noise, attenuate irrelevant semantic dimensions, and adapt representations to the recommendation objective.

It is important to distinguish the functional roles of the two model components used in this pipeline. GPT-5-nano is a generative language model employed exclusively for structured text generation tasks: constructing semantically enriched item profiles and synthesizing user preference profiles from interaction histories. It produces natural language output conforming to a predefined schema and does not generate vector representations. By contrast, all-MiniLM-L6-v2 is a dedicated sentence embedding model that maps textual descriptions into fixed-dimensional dense vectors (384 dimensions) within a continuous semantic space. This model is optimized for semantic similarity tasks and produces compact, embedding-efficient representations suitable for direct use in vector operations, graph initialization, and nearest-neighbor retrieval. The two models therefore serve complementary and non-overlapping functions: GPT-5-nano handles language understanding and structured generation, while all-MiniLM-L6-v2 handles semantic vectorization.

3.3.2. Interaction Graph Construction

The recommendation task is represented as a bipartite graph

G (V, E)

, where the node set V consists of users U and items I, and edges E correspond to observed interactions. Each interaction between user

u \in U

and item

i \in I

is encoded as an edge, and for message passing symmetry, the graph is treated as bidirectional (i.e., both u → i and i → u are included). When explicit ratings are available, the rating value is incorporated as an edge attribute, enabling the model to distinguish strong preference signals from weaker interactions during neighborhood aggregation. This formulation supports collaborative signal propagation across the graph while preserving the heterogeneous nature of users and items.

The graph representation provides an inductive bias aligned with classic collaborative filtering: users become similar through shared item neighborhoods, and items become similar through shared user neighborhoods. However, in the proposed pipeline, the graph does not operate over anonymous ID embeddings; it operates over semantically grounded vectors that already encode interpretable preference and content descriptors, thereby enabling collaborative learning to act as a refinement mechanism rather than a purely reconstructive one.

3.3.3. Graph Attention Architecture

To model user–item dependencies, we adopt a multi-layer GAT that updates node representations through attention-weighted message passing. Each GAT layer computes an updated representation by aggregating transformed neighbor embeddings, where the contribution of each neighbor is controlled by learned attention coefficients. In contrast to uniform aggregation (e.g., mean pooling in simplified graph convolution), attention allows the model to learn which interactions are most informative for refining a user’s preference representation or an item’s collaborative context. This is particularly relevant in recommendation graphs where interaction histories vary widely in quality and informativeness.

Edge attributes (e.g., ratings or frequency of occurrence) are integrated into the attention mechanism so that interaction strength can directly influence message weighting. The architecture employs multi-head attention in early layers to capture multiple relational “views” of the neighborhood structure and to improve representational capacity. To stabilize optimization and preserve the semantic information present in the LLM-derived initialization, residual connections are added between layer inputs and outputs, and normalization is applied after attention-based updates. The depth is selected to balance expressiveness with the known risk of over-smoothing in graph recommender models, where excessive propagation can collapse node representations into indistinguishable vectors.

The output of the network is a set of refined user embeddings and item embeddings in a lower-dimensional bottleneck space designed for efficient ranking. These refined embeddings are used for scoring candidate items for each user through a standard similarity operation (dot product), enabling scalable retrieval and evaluation.

3.3.4. Training Objective and Hybrid Optimization

Training optimizes a composite objective that couples a collaborative ranking criterion with a semantic alignment regularizer. The collaborative component adopts Bayesian Personalized Ranking (BPR) [44], which is widely used for learning implicit ranking functions and is effective in sparse recommendation settings Given a user u, a positive item i⁺, and a sampled negative item i⁻, BPR encourages the score of i⁺ to exceed the score of i⁻. the scoring function is defined as

{s (u, i) = z_{u}^{Τ} z}_{i}

and the BPR loss over sampled triplets is given by

L_{B P R} = - \sum_{(u, i^{+}, i^{-})} \log σ (s (u, i^{+}) - s (u, i^{-}))

where

σ

denotes the sigmoid function.

While BPR captures collaborative structure, it does not explicitly preserve the semantic organization introduced by LLM-based profiles. Without additional constraints, the model may drift toward representations that maximize ranking performance but lose interpretability and degrade cold-start behavior. To mitigate this, a cosine similarity regularization term is introduced to encourage alignment between user representations and their positively interacted items in the learned space:

L_{c o s} = - \frac{1}{| P |} + \sum_{(u, i) \in P} c o s (z_{u}, z_{i})

where P denotes the set of positive user–item pairs. The total loss is defined as a weighted sum:

{L_{t o t a l} = λ}_{B P R} L_{B P R} + λ_{c o s} L_{c o s}

This hybrid formulation operationalizes the intended role of LLMs in the pipeline: semantic profiles provide a meaningful inductive bias, whereas interaction data determines the final task-optimal configuration. In practice, the collaborative term is prioritized to ensure competitive recommendation accuracy, while the cosine term acts as a regularizer that reduces catastrophic semantic drift and helps maintain coherent geometry in the embedding space.

Positive interactions are derived from explicit feedback using a high-rating threshold, ensuring that training signals reflect clear user preference rather than ambiguous mid-scale responses. Negative samples are drawn from low-rating items as a powerful signal instead of the common assumption that unobserved interactions are more likely to be irrelevant than relevant. Model parameters are optimized using a modern adaptive optimizer with weight decay to improve generalization. Learning rate scheduling is applied to reduce the step size when optimization plateaus, and early stopping is employed to prevent overfitting and unnecessary computation. Checkpointing is incorporated so that training can resume after interruptions, and the best-performing model state can be retained for evaluation.

3.3.5. Hyperparameter Configuration

The GAT architecture consists of four successive graph attention layers with output channel dimensions of [64, 64, 64, 64]. The first three layers employ four attention heads each, with multi-head outputs concatenated to preserve representational diversity. The final layer uses a single attention head to produce a fixed-dimensional bottleneck embedding suitable for ranking. All intermediate representations share the same dimensionality of 64 units, ensuring consistent feature propagation across layers.

Model parameters are optimized using AdamW with an initial learning rate of 0.001 and a weight decay coefficient of 1 × 10⁻⁵ to improve generalization. A ReduceLROnPlateau scheduler reduces the learning rate by a factor of 0.4 whenever validation loss fails to improve for five consecutive epochs. Early stopping is applied with a patience of 15 epochs, retaining the checkpoint with the best validation performance.

The hybrid loss is parameterized as

L_{t o t a l} = 5 L_{B P R} + 2 L_{c o s}

. The weighting ratio of 5:2 was determined through a grid search over the combinations {(1, 1), (5, 1), (5, 2), (10, 1), (5, 5)}, evaluated on a held-out small validation split. This ratio reflects the design intent of the loss: the BPR term is the primary driver of ranking accuracy and is therefore assigned the dominant weight, while the cosine alignment term acts as a regularizer that prevents catastrophic semantic drift without suppressing collaborative learning. A cosine weight that is too large was empirically found to over-constrain the embedding space, limiting the model’s ability to adapt representations to the interaction structure. The selected ratio provides the best balance between ranking performance and geometric coherence of the learned embeddings.

3.4. Post-Processing Procedure

The post-processing module is designed to enhance the quality, robustness, and interpretability of the recommendations generated by the base recommender system. It operates as a modular pipeline that refines initial recommendation lists through LLM-driven reasoning, confidence-aware reranking, and systematic rank fusion strategies. Figure 5 shows the high-level procedure for post-processing steps, which are discussed in the following section.

The procedure is applied to both cold-start and warm users to evaluate its effectiveness across different data sparsity regimes. The number of interactions needed to include a user in the cold-start or warm-start scenario depends on the dataset and the distribution of the interactions among users. By combining algorithmic collaborative filtering outputs with agentic LLM workflows, the post-processing stage aims to improve ranking accuracy while maintaining scalability and analytical transparency.

For each identified user, the pipeline retrieves the top 15 recommended items generated by the main model in the previous step. These initial recommendations serve as the input to the post-processing module and include the original ranking. The resulting datasets persisted separately for cold-start and warm users to ensure reproducibility and facilitate downstream ablation studies.

3.4.1. Reranker Agent

The core component of the post-processing module is an LLM-powered pairwise reranking agent that refines the initial recommendation list using a Hybrid Confidence-Weighted Binary Search Tree (BST) strategy. Instead of relying on pointwise relevance scores, the reranker reformulates the ranking task as a sequence of pairwise preference comparisons between items, conditioned on the user’s profile. For each comparison, the LLM determines which of two candidate items is more likely to be preferred by the user and simultaneously outputs a confidence score reflecting the reliability of this judgment.

To balance ranking quality and computational efficiency, the reranking process employs a hybrid decision rule. When the confidence score associated with a pairwise comparison exceeds a predefined threshold (set to 0.6 in our implementation), the LLM’s preference decision is adopted. When confidence falls below this threshold, the system defers to the original ranking provided by the base recommender, thereby preventing low-confidence LLM judgments from degrading recommendation quality. This mechanism allows the reranker to selectively apply LLM reasoning where it is most reliable, while preserving the strengths of the underlying collaborative filtering model.

The confidence threshold of 0.6 is motivated by the probabilistic interpretation of binary preference decisions. When the LLM assigns a confidence score of 0.5 to a pairwise comparison, it indicates complete uncertainty: the model cannot distinguish which of the two items is preferable. A threshold slightly above 0.5 would be insufficient to provide reliable directional signals, as minor perturbations in prompt phrasing or temperature sampling could reverse the outcome. Setting the threshold at 0.6 requires the LLM to express at least 60% directional confidence before its judgment overrides the base collaborative filtering ranking. Below this value, the difference in predicted preference between the two items is considered too marginal to justify overriding an algorithmically derived ranking, and the original order is preserved. This conservative decision rule ensures that the LLM’s involvement is limited to comparisons where its contextual reasoning provides substantive signal, thereby preventing degradation of ranking quality in ambiguous cases. The threshold of 0.6 represents a principled lower bound on decision reliability rather than an arbitrarily selected constant.

The Binary Search Tree structure is used to efficiently organize pairwise comparisons and exploit the transitive property of preferences. Items are inserted into the tree following a strategic order that begins with mid-ranked items, promoting balanced tree construction and reducing depth. As a result, the total number of required LLM comparisons is reduced from quadratic complexity to approximately

O (n l o g n)

, where n is the number of recommended items per user. Each node in the tree retains confidence in metadata, enabling post hoc analysis of ranking stability. Once the tree is constructed, the final reranked list is obtained via reverse in-order traversal, yielding a complete LLM-informed ranking for each user.

Given the computational cost associated with LLM inference, the reranking process is executed in parallel across users using a configurable number of worker processes. This design ensures scalability to larger user cohorts while maintaining reasonable processing times. The module incorporates comprehensive error handling and retrying mechanisms to address transient API failures or timeouts, thereby improving robustness and reproducibility in practical deployment scenarios.

3.4.2. Rank Fusion and Explanation Agent

While the pure LLM-based reranked list provides valuable insights into preference-aware ordering, the post-processing pipeline further incorporates a rank fusion stage to systematically combine the original and LLM-derived rankings. This step is motivated by the complementary strengths of collaborative filtering models and LLM-based reasoning. We apply Reciprocal Rank Fusion [45] as an established rank aggregation strategy, using balanced weighting. For the final stage, the post-processing module supports the generation of natural language explanations for top-ranked recommendations (sample shown in Figure 6). For each user, a predefined number of the highest-ranked items are selected, and an LLM-based explainer agent produces concise, personalized explanations grounded in the user’s profile and item characteristics. These explanations are appended to the recommendation output and are intended to enhance transparency and user trust rather than directly influence ranking metrics. This component demonstrates the extensibility of the post-processing pipeline toward explainable and user-centric recommender systems. Figure 6 shows an example of explanations that the system can provide.

3.5. Computational Considerations and Scalability

The proposed pipeline is designed for practical scalability through a combination of lightweight model selection, API-based inference, and process-level parallelism. All LLM-based steps rely on the GPT-5-nano API in our experiments, which eliminates the need for local GPU infrastructure for generative tasks and offloads inference to a managed endpoint with predictably low per-call latency.

Profile generation incurs two LLM API calls per user: one for preference synthesis from positively rated items and one for dislike identification from negatively rated items. Item profile generation requires one API call per item. For the reranking stage, the number of LLM calls per cold-start user is bounded by

O (n l o g n)

due to the BST-based comparison structure, where n is the number of candidate items (n = 15 in our implementation), resulting in a maximum of approximately 56 comparisons per user under balanced tree conditions.

Parallelism is achieved by distributing user-level processing across multiple independent worker processes. In our experimental setup, five worker processes were employed, yielding an approximate 5× throughput increase relative to sequential execution. Workers operate independently without a shared state, and task queues are managed to respect API rate limits. Graph-based model training is implemented in PyTorch (version 2.7.1 supporting cuda 118), which supports GPU acceleration for matrix operations and attention computation, though the models used in this work are sufficiently compact to train efficiently on CPU hardware as well. During all experiments, we used a 13th Gen Intel(R) Core(TM) i9, along with 32 GB RAM and 8 GB GPU.

To further support reproducibility, the pipeline is implemented in a model-agnostic manner. Any model served through the Ollama framework can be substituted for GPT-5-nano by modifying a single configuration parameter, enabling the community to replicate and extend this work using locally hosted open-source alternatives.

4. Experimental Results

This section presents the experimental evaluation of the proposed recommender system. We first describe the data preprocessing and splitting strategy used to ensure fair and reproducible comparisons, followed by the baseline models, quantitative results across multiple datasets, and an analysis of both the core methodology and the LLM-based post-processing framework.

4.1. Data Preprocessing

To ensure a fair, realistic, and reproducible evaluation of the proposed recommender system pipeline, we applied a unified data preprocessing and splitting strategy across all datasets. The preprocessing pipeline was designed with three primary objectives: (i) preserving the statistical characteristics of the original data, (ii) enabling meaningful evaluation under cold-start and sparse-user scenarios, and (iii) preventing any form of data leakage between training and testing sets.

4.1.1. Interaction Normalization and Aggregation

All datasets were transformed into a consistent user–item interaction format. For datasets that already provided explicit ratings, interactions were used directly without modification. For the music4all dataset, containing implicit feedback with repeated events per user–item pair, interactions were first aggregated so that each user–item pair appeared exactly once. In such cases, the interaction strength was represented by the total number of occurrences (e.g., play counts), which served as a proxy for user preference intensity. When temporal information was available, the most recent timestamp per user–item pair was retained to support time-aware evaluation.

4.1.2. Sampling Strategy

For music and books datasets with many users, we applied stratified user sampling to obtain a fixed-size yet representative subset for experimentation. The sampling procedure followed a two-level stratification strategy that jointly accounts for both user activity and preference behavior.

First, users were grouped into quantile-based bins according to their total number of interactions, capturing the full spectrum of engagement levels from sparse (cold-start) users to highly active users. Second, users were further stratified based on their average interaction strength (e.g., mean rating or average listen count), reflecting differences in user preference intensity. The Cartesian combination of these two dimensions defines a set of user strata that jointly encode behavioral diversity.

Sampling was then performed proportionally from each stratum to preserve the original distribution of user activity and preference patterns. This approach ensures that no single user type is over- or under-represented in the experimental subset, yielding statistically faithful samples with negligible deviation from the original distributions. Movie datasets with a sufficiently balanced and manageable user population did not require subsampling and were retained in full.

4.1.3. Train–Test Splitting Strategy

After preprocessing and sampling, user interactions were split into training and testing sets using an 80/20 ratio. Splitting was performed on a per-user and per-item basis to ensure that every user and item appears in both sets individually, but not as a pair, thereby enabling personalized recommendation evaluation for all users. For evaluation, we only consider items that the users have never had any interaction with previously.

When temporal information was available, interactions were sorted chronologically per user, with earlier interactions assigned to the training set and later interactions to the test set. This temporal split simulates real-world recommendation scenarios, where models must predict future user behavior from historical data. For datasets lacking reliable timestamps, interactions were randomly partitioned using a fixed random seed to ensure reproducibility.

To strictly prevent data leakage, an explicit constraint was enforced such that no user–item pair appears in both training and testing sets. This condition is naturally satisfied by the preprocessing pipeline, as each user–item pair is unique after aggregation and is assigned exclusively to either the training or test split.

The resulting splits were systematically validated to ensure (i) complete user coverage in both training and testing sets, (ii) zero overlap of user–item pairs across splits, (iii) adherence to the target split ratio, and (iv) preservation of key distributional statistics such as interaction counts and average preference strength. Across all datasets, these criteria were satisfied with minimal deviation from the original data distributions.

All random operations, including stratified sampling and random splitting, were performed using fixed seeds to guarantee deterministic behavior. As a result, the preprocessing pipeline produces identical outputs across repeated runs given the same input data and software environment.

Overall, this preprocessing strategy provides a robust and principled foundation for evaluating both classical collaborative filtering models and the proposed LLM-enhanced, agentic recommender system pipeline under realistic and controlled experimental conditions.

4.2. Baseline Recommendation Algorithms

To rigorously evaluate the effectiveness of the proposed LLM-enhanced, agentic recommender system pipeline, we compare it against a diverse set of well-established baseline recommendation algorithms. We used three hidden layers with 64 hidden units, with the same final node embedding size for all graph-based models to ensure the comparison’s fairness. We evaluated our approach using various metrics and compared it with Neural Graph Collaborative Filtering (NGCF), Neural collaborative filtering (NCF) [21], LightGCN, SVD, RLMRec as the LLM-based closest baseline, and ablated GAT lacking LLM-based context aware embeddings and cosine similarity term in the loss function.

These baselines represent a broad spectrum of collaborative filtering paradigms, including matrix factorization, graph-based learning, representation learning on interaction graphs, and LLM-based embeddings. All baseline methods, except for the ablation GAT and the RLMRec, are implemented using the LibRecommender (libreco) framework (https://github.com/massquantity/LibRecommender, accessed on 1 December 2025), which provides standardized and reproducible implementations of state-of-the-art recommender models. Employing a unified implementation framework ensures that performance differences can be attributed to modeling choices rather than inconsistencies in experimental setup.

To ensure a fair comparison, all graph-based baseline models (NGCF, LightGCN, and the ablated GAT) were configured with an identical architectural specification: three hidden layers with output dimensions [64, 64, 64], yielding a final node embedding size of 64 consistent with the proposed model. All models were trained using the same learning rate of 0.001, the same optimizer (AdamW), and were subject to the same early stopping criterion based on validation loss. The number of training epochs and the data splits were identical across all methods. For non-graph baselines (NCF, SVD), the embedding dimensionality was set to 64 to match the graph-based methods. This uniform configuration ensures that observed performance differences can be attributed to architectural and representational choices rather than discrepancies in hyperparameter tuning.

RLMRec-Con is implemented as the primary LLM-augmented collaborative filtering baseline, selected for its architectural proximity to the proposed method and its model-agnostic design. Both methods use identical LLM-generated user and item profiles encoded with the same sentence transformer (all-MiniLM-L6-v2, 384-dimensional), ensuring that any observed performance differences are attributable to architectural and training choices rather than differences in semantic input. To further control for graph architecture, RLMRec-Con is applied on top of the same GATv2Conv encoder used in the proposed model—three layers with output dimensions [64, 64, 64], four attention heads per layer, and residual connections with LayerNorm—with the sole distinction that node embeddings are randomly initialized rather than seeded with LLM profile embeddings. A two-layer semantic projector (Linear (384 → 192) → LayerNorm → ReLU → Linear (192 → 64) → LayerNorm) is trained jointly with the GAT to translate frozen LLM semantic vectors into the 64-dimensional collaborative space. An InfoNCE contrastive alignment loss is then applied at each training step, using in-batch negatives sampled from 512 users and 512 items per epoch, a temperature of τ = 0.2, and an alignment weight of λ = 0.1, added to the standard BPR objective. This controlled setup isolates the central architectural question motivating the comparison: whether LLM profiles are more effectively leveraged as a direct initialization of graph node embeddings—as in the proposed method—or as a contrastive alignment target applied throughout training, as in RLMRec.

After training, each baseline model generates ranked recommendation lists for all users in the test set. Candidate items are first scored by the model, previously consumed items are filtered out, and the top-ranked unseen items are retained. This process results in a fixed-length recommendation list per user, which serves as the basis for downstream evaluation and comparison with the proposed method.

4.3. Results and Discussion

We conducted experimental evaluations using all datasets. Due to dataset sparsity, evaluations focused strictly on explicit user–item interactions to accurately measure model effectiveness. All codes, and resources are publicly accessible via our GitHub repository (https://github.com/danialebrat/when-LLM-meet-RecSys, accessed on 1 January 2026). Comparison of the results is illustrated in Figure 7, which shows the results excluding the post processing step. To enable principled comparisons across methods, we adopt a rigorous evaluation protocol. All reported scores represent the mean ± half-width of 95% bootstrap confidence intervals (2000 resamples), computed over per-user metric values. Statistical significance is assessed using paired permutation tests with Holm–Bonferroni correction across all method–metric–K comparisons within each dataset (α = 0.05). All performance improvements reported for the proposed method are statistically significant under this correction unless explicitly noted. We evaluate at two cutoff values, K ∈ {1, 5}, across four standard ranking metrics: Mean Average Precision (MAP), Normalized Discounted Cumulative Gain (NDCG), Precision, and Recall.

Beyond standard full-dataset evaluation, we conduct two additional scenario analyses to assess method robustness across the user activity spectrum. Cold-start users are defined as the bottom 5th percentile by training-set interaction count, and warm-start users as the top 5th percentile. This percentile-based definition is applied uniformly across all four datasets, yielding reproducible segments without reliance on domain-specific thresholds. Table 1 summarises the resulting user segments. The segmentation reveals meaningful heterogeneity across datasets that is critical context for interpreting downstream results. In GoodBooks, even the least active users retain at least 20 training interactions—reflecting a globally dense dataset—and the warm segment has a comparatively narrow interaction range. In contrast, ML-100K cold-start users have as few as four interactions, representing genuinely sparse profiles for which collaborative signals are nearly absent. The warm segments are equally revealing: ML-1M warm users average over 600 interactions with a maximum above 1800, an order of magnitude greater than cold-start users in the same dataset. These structural differences directly condition which method components are most valuable in each setting.

4.3.1. Results of the Main Proposed Methodology

Our proposed methodology consistently outperformed established baselines across Precision, Recall, NDCG, and MAP metrics, as shown in Figure 6. Also, Table 2 presents MAP@5 and NDCG@5—the two most discriminative metrics at the top of the ranked list—across all four datasets. The proposed method achieves the highest performance on every dataset across all metrics and both K values, and all improvements over all baselines are statistically significant after Holm–Bonferroni correction.

Key performance improvements resulted primarily from two factors: LLM-generated profiles providing semantically rich embeddings and the hybrid loss function incorporating the cosine similarity term that enhanced latent space alignment.

The consistency of these improvements across heterogeneous domains—movies, books, and music—demonstrates that the proposed pipeline is not over-specialized to a particular content type or interaction pattern. Instead, it reflects a generalizable advantage derived from the integration of LLM-based semantic modeling, agentic preprocessing, and collaborative graph learning.

A central factor underlying the observed performance improvements is the use of LLM-generated user and item profiles as semantically meaningful initialization for collaborative filtering. In contrast to baseline models that rely on random or shallow feature initialization, the proposed approach begins with dense representations that already encode high-level thematic, stylistic, and preference-related information. This initialization provides a strong inductive bias that is particularly beneficial in sparse regions of the interaction graph, where collaborative signals alone are insufficient to reliably infer user tastes or item characteristics.

Crucially, user and item profiles are constructed using an identical schema, enabling exact structural alignment. This alignment allows for similarity-based reasoning, embedding comparison, and graph message passing to operate over representations that are directly comparable at both semantic and geometric levels. As a result, collaborative learning is more effective, because the model aggregates information across nodes that already share a coherent representational language. This structural consistency helps explain the strong and stable improvements in Recall@K and MAP@K, which depend on capturing a broader set of relevant items without sacrificing ranking quality.

Because node embeddings are semantically grounded from the outset, attention mechanisms can learn to exploit meaningful relational patterns, such as strong preference signals or consistent co-consumption behaviors, more effectively than in models that start from random embeddings. Residual connections and controlled depth ensure that semantic information is preserved across layers, preventing over-smoothing and representation collapse. This refinement process contributes to the observed improvements in NDCG@K across all K values, indicating better global ranking order rather than isolated top-item gains.

The results of RLMRec-Con provide additional insight into why the proposed initialization strategy is effective. Although RLMRec-Con incorporates the same LLM-generated profiles and the same GAT backbone, it injects semantic information as a contrastive alignment signal during training rather than as a direct initialization of node embeddings. The proposed method, by contrast, ensures that semantic grounding is present from the very first forward pass, giving the attention mechanism a meaningful representational starting point before any collaborative signal is observed. This distinction proves especially consequential in sparse interaction regimes, where the contrastive alignment objective has limited positive and negative pairs to learn from, whereas semantically initialized embeddings already carry sufficient preference structure to guide early attention learning. The consistent advantage of the proposed method over RLMRec-Con across all datasets and all K values suggests that the point of injection—initialization versus training-time alignment—is a critical design choice, and that grounding node embeddings semantically prior to collaborative learning yields stronger and more stable recommendation performance than recovering semantic structure through an auxiliary loss after training has begun.

Furthermore, a notable pattern emerges when comparing performance margins across datasets of varying interaction density. The proposed method’s advantage over all baselines is most pronounced on MovieLens 100K and GoodBooks-10K, both of which are characterized by relatively sparse interaction data, and narrows on MovieLens 1M, which shares a largely overlapping user base with MovieLens 100K but contains substantially more users, items, and interactions. This pattern suggests an inverse relationship between interaction density and the relative benefit of semantic initialization: as the volume of collaborative signal grows, purely interaction-driven models have more data to learn from and naturally close the gap. Conversely, in sparser settings where collaborative signals are insufficient to reliably infer user preferences, beginning with semantically grounded embeddings provides a stronger inductive bias that compensates for data scarcity. This observation reinforces the practical value of the proposed approach in real-world cold-start and low-data scenarios, where the combination of LLM-derived semantic structure and graph-based collaborative learning offers the most meaningful gains over conventional methods. Importantly, this pattern does not imply that the proposed method loses its advantage at scale—it remains the best-performing method even on ML-1M—but rather that the marginal contribution of semantic initialization decreases as interaction data grows. This has a practical corollary: the method’s value proposition is strongest precisely in the deployment scenarios that are most challenging for standard collaborative filtering, namely early-stage platforms, niche domains, and new users.

4.3.2. Cold-Start Users: Robustness Under Sparse Signals

Table 3 presents MAP@1 for the bottom 5% and top 5% of users by training interaction count, enabling a direct assessment of how each method behaves at the extremes of the activity spectrum. The cold-start results expose a fundamental vulnerability in graph-based baselines that is not visible in aggregate evaluation. Both GAT and RLMRec collapse to near-zero performance on Music4All cold-start users, and RLMRec similarly fails on ML-1M cold-start users. The mechanism is the same in both cases: graph neighborhood propagation requires a minimum density of interaction edges to propagate meaningful signals, and contrastive alignment requires a sufficient pool of informative positive–negative pairs to learn from. When a user has fewer than 13 interactions (the maximum for ML-100K cold users), neither condition is met, and these methods effectively fall back to uninformative representations. The proposed method avoids this failure mode because its node embeddings are semantically grounded before any collaborative signal is processed. A user with four interactions still receives an initial embedding that encodes meaningful preference information extracted from item metadata, and the graph attention mechanism can operate on these representations even when the interaction graph is locally sparse. This is not a trivial advantage: it means the method can produce useful recommendations for new users from their very first interactions, rather than requiring a warm-up period during which recommendations are unreliable.

The one setting where the cold-start advantage narrows is ML-1M, where cold users have a higher absolute interaction floor (11–18 interactions vs. 4–13 for ML-100K). As the definition of ‘cold’ shifts upward, the advantage of semantic initialization diminishes because competing methods have more collaborative signal to work with. This confirms that the cold-start benefit is primarily driven by the depth of data scarcity rather than by any dataset-specific characteristic.

The wide confidence intervals on cold-start results—particularly for graph-based baselines where performance oscillates near zero—reflect genuine uncertainty from small evaluation sets (47–302 users) rather than model instability.

4.3.3. Warm-Start Users: Where Collaborative Filtering Recovers

All methods improve substantially for warm users compared to the full-dataset average, confirming that dense interaction histories benefit every approach. However, the relative rankings shift in ways that offer important nuance beyond what the aggregate results suggest.

On MovieLens datasets, the proposed method achieves its highest absolute scores and maintains its lead over all baselines, demonstrating that LLM-derived semantic ranking continues to provide meaningful signals even when collaborative filtering operates at its strongest. The margins over GAT and LightGCN on ML-100K warm users are substantial, indicating that for highly active users in dense interaction environments, there are many plausible candidates, and the LLM’s capacity to distinguish subtle quality differences among CF candidates—rather than merely selecting obvious ones—is particularly valuable.

GoodBooks presents a notable exception: SVD outperforms the proposed method for warm users, reversing the global ranking. A parallel pattern holds in Music4All, where performance differences among top methods converge to near zero. This finding reveals an important boundary condition. For users with well-populated interaction histories in content-rich domains where item semantics are well-captured by interaction data—books and music, where titles, genres, and collaborative patterns are tightly correlated—standard matrix factorization is already highly competitive, and the additional signal from LLM profiles offers diminishing returns. Put differently, warm users in these domains have effectively communicated their preferences through interactions, leaving limited headroom for content-based augmentation.

A particularly interesting anomaly is the behaviour of GAT and RLMRec on Music4All warm users, both of which return MAP@1 = 0.000 despite showing non-zero performance for cold users in the same dataset. This suggests overfitting the global interaction distribution: these models learn representations calibrated for the average user but fail to specialize effectively for either extreme of the activity spectrum. The proposed method does not exhibit this failure because its LLM-initialized embeddings provide a stable representational prior that is not overwritten by collaborative training but refined by it.

4.3.4. Results of the Proposed Post-Processing Methodology

Table 4 isolates the contribution of the postprocessing pipeline by comparing three variants on the same cold-start and warm-start user subsets: (i) Base CF—the raw GAT output without postprocessing; (ii) LLM Reranked—the CF candidate list reordered solely by LLM pairwise preference scores; and (iii) RRF Balanced—Reciprocal Rank Fusion combining CF and LLM rankings with equal weights.

The most consistent finding in Table 4 is that pure LLM reranking degrades performance relative to Base CF across all tested datasets and user segments. This result is counterintuitive if one expects LLM semantic reasoning to add value monotonically, but it makes sense upon closer inspection: the LLM’s pairwise preference judgements reflect general semantic coherence—whether one item is thematically more consistent with a user’s profile than another—rather than personalized relevance grounded in observed behavior. When the LLM reranker displaces the top CF candidates, it substitutes semantically plausible but behaviorally unvalidated items, which reduces precision at top ranks where CF already makes high-confidence recommendations.

The magnitude of the degradation is informative. On ML-100K cold users, LLM reranking drops MAP@1 from Base CF to less than half its value. On Music4All warm users, the LLM reranker produces MAP@1 = 0.000, completely eliminating the signal that Base CF had established. These patterns indicate that the LLM, operating in isolation, is not calibrated to the specific interaction patterns of individual users—it lacks the personalization signal that collaborative filtering is specifically designed to extract.

RRF Balanced consistently matches or outperforms Base CF, demonstrating that rank fusion is an effective and principled integration strategy. The mechanism is intuitive: RRF promotes items that rank highly in both the CF and LLM ranked lists, while demoting items that appear highly in only one. This has the effect of preserving CF’s personalization signal—high-precision CF candidates that the LLM also endorses are promoted further—while selectively incorporating the LLM’s semantic reordering in cases where CF uncertainty is higher.

A nuanced pattern emerges between cold and warm users. For cold users, RRF fusion provides gains over Base CF on ML-1M but shows no improvement on GoodBooks—likely because the GoodBooks cold segment is comparatively interaction-rich (minimum 20 training interactions), meaning Base CF is already reasonably well-calibrated, and the LLM’s additional signal cannot reliably improve rankings that are already strong. For warm users, RRF Balanced outperforms Base CF on GoodBooks, ML-100K, and ML-1M, with the largest gains in the most interaction-rich settings. This counterintuitive result—that warm users benefit more from semantic augmentation via fusion than cold users do—can be explained by candidate density: warm users have many plausible relevant items in the top-50 candidate pool, and the LLM’s ability to discriminate among high-quality CF candidates provides real uplift. Cold users, by contrast, have sparse candidate pools where even the CF ranking is uncertain, limiting the LLM’s ability to improve it reliably.

These ablation results carry two actionable conclusions for deployment. First, LLM reranking should never be applied in isolation: the consistent degradation across all segments confirms that semantic reasoning alone cannot substitute for personalized collaborative filtering. Second, RRF-based rank fusion is a low-risk integration strategy—it never performs meaningfully worse than Base CF and reliably improves it whenever the candidate pool is sufficiently dense. Given that LLM inference carries non-trivial computational cost, the results also suggest a selective deployment strategy: postprocessing is most cost-effective for warm users in large-scale datasets (where gains are largest and the candidate pool is richest) and adds comparatively little value for cold-start users where Base CF is already weak and the LLM cannot compensate for absent collaborative signals.

4.4. Evaluation of Recommendation Explanations

Beyond ranking accuracy, the proposed pipeline generates a natural language justification for each recommendation. We evaluate the quality of these explanations along two complementary axes: LLM-as-Judge, which assesses semantic quality through multi-model scoring, and the Semantic Alignment Score (SAS), an embedding-based measure of how closely the explanation is anchored to both the user profile and the item profile. The two protocols are deliberately independent—one relies on discrete judgements from language models, the other on continuous cosine similarity in a shared vector space—so that their agreement or divergence is itself informative.

4.4.1. Evaluation Setup for Explanations

Explanations are generated for the top-3 ranked recommendations of 100 randomly sampled users per dataset (300 explanations per dataset, 1200 in total) using a structured chain-of-thought prompt. The prompt requires the model to first identify the user’s most salient stated preferences, then match them to item attributes, and finally compose a 2–3 sentence personalized justification—with generic phrasing explicitly prohibited. For LLM-as-Judge scoring, three open-source models serve as evaluators: DeepSeek-R1-1.5B, Gemma3-4B, and LLaMA3.2-3B. These are intentionally distinct from the generator model to prevent self-evaluation bias, and they differ in size and training methodology to capture a range of judging tendencies. Each explanation is scored on four dimensions—Relevance, Faithfulness, Personalization, and Coherence—on a 1–5 Likert scale, and the three scores are averaged per explanation. For SAS, all explanation, user, and item texts are encoded with all-MiniLM-L6-v2, and cosine similarity is computed separately against the user profile (SAS_user), the item profile (SAS_item), and their average (SAS_combined).

4.4.2. Results of Explanations Analysis

Table 5 and Table 6 report the LLM-as-Judge and SAS results respectively. The explanations achieve a strong overall LLM-judge average of 4.19/5.00. Coherence is the highest-scoring dimension (4.44), followed by Relevance (4.28), Faithfulness (4.10), and Personalization (3.94). This ordering is consistent across all four datasets, indicating a systematic rather than dataset-specific pattern. The structured prompt design—which sequences feature matching before prose composition—directly accounts for the high and stable Coherence scores: by anchoring generation to a logical narrative skeleton, the output is almost universally fluent and internally consistent.

Personalization is the primary bottleneck, and both evaluation protocols converge to this finding independently. The LLM judge assigns the lowest scores to this dimension, and the SAS results reveal a consistent asymmetry: SAS_item (0.720) is systematically higher than SAS_user (0.579) across all datasets, with a mean gap of approximately 0.14. This means that in the embedding space, the generated text gravitates more toward describing the item than toward reflecting the user’s specific preferences. The reason is structural: item profiles offer concrete, discrete cues (genre, mood, title, attributes), whereas user profiles express more abstract, aggregated tastes that are harder to mirror lexically. Future iterations should consider mechanisms that enforce stronger lexical overlap with user profile content.

Individual judge models differ substantially in their scoring tendencies: DeepSeek-R1-1.5B assigns the most conservative scores (global mean ∼3.1), Gemma3-4B the most generous (∼4.9), and LLaMA3.2-3B falls in between (∼4.3). This spread confirms that any single-model evaluation would misrepresent explanation quality in a predictable direction. The inter-model disagreement is largest for Personalization, which has the highest standard deviation across judges (∼1.1–1.3), and smallest for Coherence (∼0.7–1.0)—indicating that structural text quality is more objectively assessed than the subjective question of whether an explanation is sufficiently tailored to an individual. Averaging across the three judges neutralizes these individual tendencies and yields a more balanced quality signal.

5. Conclusions

This paper proposed a unified recommender system pipeline that integrates LLMs as semantic priors within a GAT-based collaborative filtering model rather than as standalone ranking mechanisms. Through an agentic preprocessing framework, semantically rich and structurally aligned user and item profiles were generated across multiple domains, enabling collaborative learning to refine meaningful representations instead of inferring latent structure from sparse interactions alone.

The experimental results support three principal conclusions. First, the proposed method achieves consistent and statistically significant superiority across all four datasets, all metrics, and both cutoff values—demonstrating generalizability beyond any single domain or interaction structure. Second, the method sustains meaningful recommendation quality for the least active 5% of users in every dataset, including scenarios where graph-based baselines collapse to near-zero, establishing cold-start robustness as a genuine architectural property rather than an artefact of favorable evaluation conditions. Third, the ablation study confirms that the performance gains stem from the fusion of CF and LLM signals: neither component alone is sufficient, but their combination through RRF consistently extends beyond the collaborative baseline, particularly for warm users and large-scale datasets where the candidate pool supports meaningful semantic discrimination.

Across the analyses, two boundary conditions constrain the proposed method’s advantage. In content-rich, interaction-dense settings—specifically warm users on GoodBooks and Music4All—standard matrix factorization recovers competitiveness, suggesting that future work should explore adaptive weighting of collaborative and semantic components as a function of user activity and domain characteristics. In very small cold-start segments, confidence intervals remain wide due to limited evaluation samples, warranting caution in interpreting precise magnitude estimates. Within these boundaries, the results provide strong support for the general principle that initializing graph node embeddings with LLM-derived semantic representations—rather than injecting LLM information as an auxiliary alignment target—is a more effective and robust strategy for bridging semantic and collaborative signals in recommendation.

Finally, future directions can be focused on more fine-grained ablations, isolating each sub-component, the architecture of the model, and introducing cross-domain recommendations.

Author Contributions

Conceptualization, D.E., S.A. and L.R.; methodology, D.E., S.A. and L.R.; software, D.E. and S.A.; validation, D.E., S.A. and L.R.; formal analysis, D.E., S.A. and L.R.; investigation, D.E., S.A. and L.R.; resources, D.E., S.A. and L.R.; data curation, D.E. and S.A.; writing—original draft preparation, D.E. and S.A.; writing—review and editing, D.E., S.A. and L.R.; visualization, D.E., S.A. and L.R.; supervision, L.R.; project administration, D.E., S.A. and L.R. All authors have read and agreed to the published version of the manuscript.

Funding

This research work has been partially supported by the Natural Sciences and Engineering Research Council of Canada, NSERC.

Data Availability Statement

All codes, original datasets, preprocessed datasets and models are available in our GitHub repository at https://github.com/danialebrat/when-LLM-meet-RecSys, accessed on 1 January 2026.

Acknowledgments

During the preparation of this study, the authors used GPT-5-Nano OpenAI API and Claude code as a core part of the study (LLMs in recommender system), including implementation steps and documentation. The authors have reviewed and edited the output and take full responsibility for the content of this publication. All authors have consented to the acknowledgement.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Zhao, Z.; Fan, W.; Li, J.; Liu, Y.; Mei, X.; Wang, Y.; Wen, Z.; Wang, F.; Zhao, X.; Tang, J.; et al. Recommender Systems in the Era of Large Language Models (LLMs). IEEE Trans. Knowl. Data Eng. 2024, 36, 6889–6907. [Google Scholar] [CrossRef]
Wu, L.; Zheng, Z.; Qiu, Z.; Wang, H.; Gu, H.; Shen, T.; Qin, C.; Zhu, C.; Zhu, H.; Liu, Q.; et al. A survey on large language models for recommendation. World Wide Web 2024, 27, 60. [Google Scholar] [CrossRef]
Lin, J.; Dai, X.; Xi, Y.; Liu, W.; Chen, B.; Zhang, H.; Liu, Y.; Wu, C.; Li, X.; Zhu, C.; et al. How Can Recommender Systems Benefit from Large Language Models: A Survey. ACM Trans. Inf. Syst. 2023, 43, 28. [Google Scholar] [CrossRef]
Rentfrow, P.J.; Gosling, S.D. The do re mi’s of everyday life: The structure and personality correlates of music preferences. J. Pers. Soc. Psychol. 2003, 84, 1236–1256. [Google Scholar] [CrossRef]
Ebrat, D.; Aminian, T.; Ahmadian, S.; Rueda, L. End-to-End Personalization: Unifying Recommender Systems with Large Language Models. arXiv 2025, arXiv:2508.01514. [Google Scholar]
Ebrat, D.; Ahmadian, S.; Rueda, L. Vectorized Context-Aware Embeddings for GAT-Based Collaborative Filtering. arXiv 2025, arXiv:2510.264612025. [Google Scholar] [CrossRef]
Xi, Y.; Liu, W.; Lin, J.; Cai, X.; Zhu, H.; Zhu, J.; Chen, B.; Tang, R.; Zhang, W.; Yu, Y. Towards Open-World Recommendation with Knowledge Augmentation from Large Language Models. In Proceedings of the 18th ACM Conference on Recommender Systems; ACM: New York, NY, USA, 2024; pp. 12–22. [Google Scholar] [CrossRef]
Liu, F.; Liu, Y.; Chen, H.; Cheng, Z.; Nie, L.; Kankanhalli, M. Understanding Before Recommendation: Semantic Aspect-Aware Review Exploitation via Large Language Models. ACM Trans. Inf. Syst. 2023, 43, 44. [Google Scholar] [CrossRef]
Torbati, G.H.; Tigunova, A.; Yates, A.; Weikum, G. Recommendations by Concise User Profiles from Review Text. arXiv 2023, arXiv:2311.01314. [Google Scholar] [CrossRef]
Moon, J.; Park, S.; Lee, J. LLM-Enhanced Linear Autoencoders for Recommendation; ACM: New York, NY, USA, 2025. [Google Scholar] [CrossRef]
Suzumura, T.; Ikari, H.; Kanezashi, H.; Rahman, M.M.; Hirate, Y. SymCERE: Symmetric Contrastive Learning for Robust Review-Enhanced Recommendation. arXiv 2025, arXiv:2504.02195. [Google Scholar]
Spillo, G.; Musto, C.; Mannavola, M.; de Gemmis, M.; Lops, P.; Semeraro, G. GAL-KARS: Exploiting LLMs for Graph Augmentation in Knowledge-Aware Recommender Systems. In Proceedings of the 33rd ACM Conference on User Modeling, Adaptation and Personalization; ACM: New York, NY, USA, 2025; pp. 73–82. [Google Scholar] [CrossRef]
Shi, K.; Sun, X.; Wang, D.; Fu, Y.; Xu, G.; Li, Q. LLaMA-E: Empowering E-Commerce Authoring with Object-Interleaved Instruction Following; Association for Computational Linguistics: Stroudsburg, PA, USA, 2023. [Google Scholar]
Li, Y.; Ma, S.; Wang, X.; Huang, S.; Jiang, C.; Zheng, H.-T.; Xie, P.; Huang, F.; Jiang, Y. EcomGPT: Instruction-Tuning Large Language Models with Chain-of-Task Tasks for E-commerce. Proc. AAAI Conf. Artif. Intell. 2024, 38, 18582–18590. [Google Scholar] [CrossRef]
Yang, T.; Ren, B.; Gu, C.; Xu, F.; Ma, B.; Konomi, S. Enhancing Course Recommendation with LLM-Generated Concepts: A Unified Framework for Side Information Integration. Big Data Cogn. Comput. 2025, 9, 311. [Google Scholar] [CrossRef]
Chen, J.; Ma, L.; Li, X.; Thakurdesai, N.; Xu, J.; Cho, J.H.; Nag, K.; Korpeoglu, E.; Kumar, S.; Achan, K. Knowledge Graph Completion Models are Few-shot Learners: An Empirical Study of Relation Labeling in E-commerce with LLMs. arXiv 2023, arXiv:2305.09858. [Google Scholar]
Chu, Z.; Wang, Y.; Cui, Q.; Li, L.; Chen, W.; Qin, Z.; Ren, K. LLM-Guided Multi-View Hypergraph Learning for Human-Centric Explainable Recommendation. arXiv 2024, arXiv:2401.08217. [Google Scholar]
Du, Y.; Luo, D.; Yan, R.; Wang, X.; Liu, H.; Zhu, H.; Song, Y.; Zhang, J. Enhancing Job Recommendation through LLM-Based Generative Adversarial Networks. Proc. AAAI Conf. Artif. Intell. 2024, 38, 8363–8371. [Google Scholar] [CrossRef]
Christakopoulou, K.; Lalama, A.; Adams, C.; Qu, I.; Amir, Y.; Chucri, S.; Vollucci, P.; Soldo, F.; Bseiso, D.; Scodel, S.; et al. Large Language Models for User Interest Journeys. arXiv 2023, arXiv:2305.15498. [Google Scholar] [CrossRef]
Zheng, Z.; Zhu, Y.; Liu, H.; Ju, M.; Zhao, T.; Shah, N.; Li, J. MI4Rec: Pretrained Language Model based Cold-Start Recommendation with Meta-Item Embeddings. In Proceedings of the 34th ACM International Conference on Information and Knowledge Management; ACM: New York, NY, USA, 2025; pp. 4455–4465. [Google Scholar] [CrossRef]
He, X.; Liao, L.; Zhang, H.; Nie, L.; Hu, X.; Chua, T.-S. Neural Collaborative Filtering. In Proceedings of the 26th International Conference on World Wide Web (WWW 2017); Barrett, R., Cummings, R., Agichtein, E., Gabrilovich, E., Eds.; International World Wide Web Conferences Steering Committee: Perth, Australia, 2017; pp. 173–182. [Google Scholar] [CrossRef]
Cheng, M.; Liu, Q.; Zhang, W.; Liu, Z.; Zhao, H.; Chen, E. A general tail item representation enhancement framework for sequential recommendation. Front. Comput. Sci. 2024, 18, 186333. [Google Scholar] [CrossRef]
Wang, H.; Zhang, F.; Wang, J.; Zhao, M.; Li, W.; Xie, X.; Guo, M. RippleNet. In Proceedings of the 27th ACM International Conference on Information and Knowledge Management; ACM: New York, NY, USA, 2018; pp. 417–426. [Google Scholar] [CrossRef]
Su, X.; Khoshgoftaar, T.M. A Survey of Collaborative Filtering Techniques. Adv. Artif. Intell. 2009, 2009, 421425. [Google Scholar] [CrossRef]
Lin, J.; Chen, B.; Wang, H.; Xi, Y.; Qu, Y.; Dai, X.; Zhang, K.; Tang, R.; Yu, Y.; Zhang, W. ClickPrompt: CTR Models are Strong Prompt Generators for Adapting Language Models to CTR Prediction. In Proceedings of the ACM Web Conference 2024; ACM: New York, NY, USA, 2024; pp. 3319–3330. [Google Scholar] [CrossRef]
Yang, S.; Wang, C.; Liu, Y.; Xu, K.; Ma, W.; Liu, Y.; Zhang, M.; Zeng, H.; Feng, J.; Deng, C. Collaborative Word-based Pre-trained Item Representation for Transferable Recommendation. In Proceedings of the 2023 IEEE International Conference on Data Mining (ICDM); IEEE: New York, NY, USA, 2023; pp. 728–737. [Google Scholar] [CrossRef]
Bao, K.; Zhang, J.; Zhang, Y.; Wang, W.; Feng, F.; He, X. TALLRec: An Effective and Efficient Tuning Framework to Align Large Language Model with Recommendation. In Proceedings of the 17th ACM Conference on Recommender Systems; ACM: New York, NY, USA, 2023; pp. 1007–1014. [Google Scholar] [CrossRef]
Di Palma, D.; Biancofiore, G.M.; Anelli, V.W.; Narducci, F.; Di Noia, T.; Di Sciascio, E. Evaluating ChatGPT as a Recommender System: A Rigorous Approach. arXiv 2024, arXiv:2309.03613. [Google Scholar]
Yue, Z.; Rabhi, S.; de Wang, D.S.G.; Oldridge, E. LlamaRec: Two-Stage Recommendation using Large Language Models for Ranking. arXiv 2023, arXiv:2311.02089. [Google Scholar]
Wang, Y.; Liu, Z.; Zhang, J.; Yao, W.; Heinecke, S.; Yu, P.S. DRDT: Dynamic Reflection with Divergent Thinking for LLM-based Sequential Recommendation. arXiv 2023, arXiv:2312.11336. [Google Scholar] [CrossRef]
Hou, Y.; Zhang, J.; Lin, Z.; Lu, H.; Xie, R.; McAuley, J.; Zhao, W.X. Large Language Models are Zero-Shot Rankers for Recommender Systems. In Proceedings of the European Conference on Information Retrieval; Springer Nature: Cham, Switzerland, 2024. [Google Scholar]
Golub, G.H. Least squares, singular values and matrix approximations. Appl. Math. 1968, 13, 44–51. [Google Scholar] [CrossRef]
Gori, M.; Pucci, A. ItemRank: A random-walk based scoring algorithm for recommender engines. In Proceedings of the 20th International Joint Conference on Artifical Intelligence; In IJCAI’07; Morgan Kaufmann Publishers Inc.: San Francisco, CA, USA, 2007; pp. 2766–2771. [Google Scholar]
He, X.; Gao, M.; Kan, M.-Y.; Wang, D. BiRank: Towards Ranking on Bipartite Graphs. IEEE Trans. Knowl. Data Eng. 2017, 29, 57–71. [Google Scholar] [CrossRef]
Yang, Q.; Zhang, S.; Dong, S.; Xu, L.; Dong, W.; Li, X.; Sun, P.; Jiang, F.; Zhang, X.; Luo, G. Graph Convolutional Network with Neural Inductive Matrix Completion for Predicting Disease-Related LncRNA Genes. In Proceedings of the 2023 IEEE International Conference on Bioinformatics and Biomedicine (BIBM); IEEE: New York, NY, USA, 2023; pp. 3595–3601. [Google Scholar] [CrossRef]
Ying, R.; He, R.; Chen, K.; Eksombatchai, P.; Hamilton, W.L.; Leskovec, J. Graph Convolutional Neural Networks for Web-Scale Recommender Systems. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining; ACM: New York, NY, USA, 2018; pp. 974–983. [Google Scholar] [CrossRef]
He, X.; Deng, K.; Wang, X.; Li, Y.; Zhang, Y.; Wang, M. LightGCN. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval; ACM: New York, NY, USA, 2020; pp. 639–648. [Google Scholar] [CrossRef]
Chernov, A.; Wahab, H.; Novitskij, O. Leveraging Language Semantics for Collaborative Filtering with TextGCN and TextGCN-MLP: Zero-Shot vs In-Domain Performance. arXiv 2025, arXiv:2510.12461. [Google Scholar]
Elahi, E.; Anwar, S.; Al-Kfairy, M.; Rodrigues, J.J.; Ngueilbaye, A.; Halim, Z.; Waqas, M. Graph attention-based neural collaborative filtering for item-specific recommendation system using knowledge graph. Expert Syst. Appl. 2025, 266, 126133. [Google Scholar] [CrossRef]
Zhang, S.; Li, Z.; Wang, X.; Chen, Z.; Guo, W. TKGAT: Temporal Knowledge Graph Representation Learning Using Attention Network; Springer Nature: Cham, Switzerland, 2023; pp. 46–61. [Google Scholar] [CrossRef]
Ren, X.; Wei, W.; Xia, L.; Su, L.; Cheng, S.; Wang, J.; Yin, D.; Huang, C. Representation Learning with Large Language Models for Recommendation. In Proceedings of the ACM Web Conference 2024; ACM: New York, NY, USA, 2024; pp. 3464–3475. [Google Scholar] [CrossRef]
Harper, F.M.; Konstan, J.A. The MovieLens Datasets: History and Context. 2015. Available online: http://grouplens.org/datasets/movielens (accessed on 1 January 2026).
Zajac, Z. Goodbooks-10k: A New Dataset for Book Recommendations; FastML: London, UK, 2017. [Google Scholar]
Santana, I.A.P.; Pinhelli, F.; Donini, J.; Catharin, L.; Mangolin, R.B.; Feltrim, V.D.; Domingues, M.A. Music4All: A New Music Database and Its Applications. In Proceedings of the 2020 International Conference on Systems, Signals and Image Processing (IWSSIP), Niterói, Brazil, 1–3 July 2020; pp. 399–404. [Google Scholar]
Cormack, G.V.; Clarke, C.L.A.; Buettcher, S. Reciprocal rank fusion outperforms condorcet and individual rank learning methods. In Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval; In SIGIR ’09; Association for Computing Machinery: New York, NY, USA, 2009; pp. 758–759. [Google Scholar] [CrossRef]

Figure 1. High-level Methodological framework. First, we generate profiles and convert them to vector embeddings, and create the graph based on the interactions. The embeddings will be used as initial embeddings of nodes as learnable parameters using a customized loss function. Finally, LLM agents will use generated profiles and initial recommendations to rerank the output and provide explanations.

Figure 2. User and Item profile generation procedure. Items meta data are enriched using additional sources (e.g., TMDB API) and LLM agents. We create item profiles using new metadata. Then, using these profiles and top-rated items for each user, we extract users’ preferences, and based on the 10 lowest rated items, with respect to their established preferences, the full user profile will be generated. In the figure, blue blocks are the steps implemented via LLM agents, and green blocks are the final states of user profiles and item profiles.

Figure 3. Sample user and item profile structure following the profile alignment strategy.

Figure 4. Model training methodology: After creating the bipartite graph, the embeddings based on profiles will be used as initial embeddings of nodes as learnable parameters using a customized loss function.

Figure 5. Post-processing methodology: (i) the reranker agent creates a hybrid confidence-weighted Binary Search Tree by comparing pair recommendations from each user’s top 15 recommended items with respect to their profile. (ii) Merging the reranked recommendations with the original ranking. (iii) The explainer agent provides justifications for each recommended item upon request. Blue blocks are the steps implemented using LLM-agents.

Figure 6. Example of an explanation for a recommended item with respect to user and item profiles.

Figure 7. Results comparison between baselines and the proposed method before post-processing.

Table 1. User segment statistics considering minimum, maximum, mean and median number of interactions (training-set interaction counts).

Dataset	Segment	N Users	Min	Max	Mean	Median	Overall Mean
Goodbook	Cold (5%)	50	20	56	44.24	48.0	89.44
Goodbook	Warm (5%)	50	128	146	134.68	133.5	89.44
Music4all	Cold (5%)	50	6	48	31.24	32.0	147.27
Music4all	Warm (5%)	50	250	338	278.30	270.5	147.27
ML-100k	Cold (5%)	47	4	13	10.72	11.0	84.84
ML-100k	Warm (5%)	47	260	685	335.53	315.0	84.84
ML-1M	Cold (5%)	302	11	18	16.48	17.0	132.48
ML-1M	Warm (5%)	302	449	1867	635.05	590.0	132.48

Table 2. Overall results: MAP@5 and NDCG@5 across all users and all datasets. The highest performance among all baselines is shown in bold, and the last row shows the performance of our methodology along with the percentage of improvement over the best baseline, indicated in green.

Dataset		Goodbook		Music4All		ML-100K		ML-1M
	Metric	MAP	NDCG	MAP	NDCG	MAP	NDCG	MAP	NDCG
Method		MAP	NDCG	MAP	NDCG	MAP	NDCG	MAP	NDCG
SVD		0.0848 ± 0.0088	0.1575 ± 0.0116	0.0447 ± 0.0065	0.0544 ± 0.0074	0.2240 ± 0.0209	0.3727 ± 0.0239	0.0812 ± 0.0035	0.1629 ± 0.0045
NGCF		0.0754 ± 0.0077	0.1453 ± 0.0110	0.0438 ± 0.0074	0.0494 ± 0.0075	0.2669 ± 0.0249	0.4110 ± 0.0245	0.2076 ± 0.0060	0.3208 ± 0.0069
LightGCN		0.0870 ± 0.0087	0.1619 ± 0.0122	0.1019 ± 0.0135	0.1037 ± 0.0123	0.1850 ± 0.0196	0.3134 ± 0.0219	0.1751 ± 0.0058	0.2683 ± 0.0066
NCF		0.0349 ± 0.0050	0.0736 ± 0.0079	0.0100 ± 0.0028	0.0146 ± 0.0034	0.1972 ± 0.0219	0.3319 ± 0.0226	0.1063 ± 0.0044	0.1962 ± 0.0053
GAT		0.0435 ± 0.0059	0.0878 ± 0.0086	0.0188 ± 0.0042	0.0248 ± 0.0052	0.2165 ± 0.0244	0.3242 ± 0.0259	0.1766 ± 0.0062	0.2605 ± 0.0070
RLMRec		0.0491 ± 0.0068	0.0946 ± 0.0095	0.0179 ± 0.0046	0.0198 ± 0.0043	0.1661 ± 0.0229	0.2585 ± 0.0251	0.1569 ± 0.0059	0.2357 ± 0.0067
Our Method		0.1145 ± 0.0100	0.1972 ± 0.0138	0.1261 ± 0.0171	0.1287 ± 0.0151	0.3448 ± 0.0286	0.4713 ± 0.0261	0.2429 ± 0.0070	0.3410 ± 0.0073
		+31.6%	+21.8%	+23.7%	+24.1%	+29.1%	+14.6%	+17%	+6.2%

Table 3. Cold-start and warm-start MAP@1 and NDCG@1 results. Values represent mean scores for the bottom 5% and top 5% of users by training interaction count. The highest performance is shown in bold.

Scenario	Dataset		Goodbook		Music4All		ML-100K		ML-1M
		Metric	MAP	NDCG	MAP	NDCG	MAP	NDCG	MAP	NDCG
	Method		MAP	NDCG	MAP	NDCG	MAP	NDCG	MAP	NDCG
Cold	SVD		0.1400 ± 0.0900	0.1640 ± 0.0900	0.1200 ± 0.0900	0.0932 ± 0.0729	0.1915 ± 0.1170	0.2404 ± 0.1032	0.0861 ± 0.0315	0.0925 ± 0.0298
	NGCF		0.0800 ± 0.0700	0.0800 ± 0.0661	0.1200 ± 0.0900	0.0603 ± 0.0557	0.2553 ± 0.1170	0.3032 ± 0.1208	0.1093 ± 0.0348	0.1183 ± 0.0343
	LightGCN		0.1400 ± 0.1000	0.1600 ± 0.0940	0.0800 ± 0.0700	0.0534 ± 0.0567	0.1702 ± 0.1064	0.2660 ± 0.1032	0.1358 ± 0.0381	0.1442 ± 0.0391
	NCF		0.1200 ± 0.0900	0.1240 ± 0.0800	0.0400 ± 0.0500	0.0151 ± 0.0216	0.1277 ± 0.0957	0.1617 ± 0.0957	0.1060 ± 0.0348	0.1123 ± 0.0349
	GAT		0.1000 ± 0.0800	0.1000 ± 0.0900	0.0000 ± 0.0000	0.0003 ± 0.0004	0.1064 ± 0.0957	0.1223 ± 0.0809	0.0662 ± 0.0281	0.0603 ± 0.0255
	RLMRec		0.0200 ± 0.0300	0.0200 ± 0.0300	0.0200 ± 0.0300	0.0003 ± 0.0004	0.0000 ± 0.0000	0.0170 ± 0.0213	0.0298 ± 0.0182	0.0278 ± 0.0172
	Our Method		0.2600 ± 0.1300	0.2760 ± 0.1140	0.1800 ± 0.1100	0.1145 ± 0.0783	0.2766 ± 0.1277	0.3553 ± 0.1197	0.1457 ± 0.0414	0.1498 ± 0.0382
Warm	SVD		0.2600 ± 0.1200	0.2800 ± 0.1100	0.0200 ± 0.0300	0.0250 ± 0.0325	0.4286 ± 0.2500	0.5286 ± 0.1857	0.2583 ± 0.0480	0.3205 ± 0.0427
	NGCF		0.1800 ± 0.1000	0.2000 ± 0.1000	0.0800 ± 0.0700	0.0565 ± 0.0513	0.6429 ± 0.2500	0.6571 ± 0.1714	0.6093 ± 0.0547	0.6589 ± 0.0384
	LightGCN		0.2200 ± 0.1200	0.2240 ± 0.1060	0.0400 ± 0.0500	0.0500 ± 0.0465	0.7143 ± 0.2500	0.6429 ± 0.2071	0.6788 ± 0.0513	0.7139 ± 0.0371
	NCF		0.0400 ± 0.0500	0.0400 ± 0.0500	0.0000 ± 0.0000	0.0000 ± 0.0000	0.1429 ± 0.1786	0.4000 ± 0.1643	0.3510 ± 0.0530	0.4338 ± 0.0440
	GAT		0.1200 ± 0.0900	0.1480 ± 0.0860	0.0000 ± 0.0000	0.0000 ± 0.0000	0.7857 ± 0.2143	0.7429 ± 0.1571	0.7483 ± 0.0480	0.7583 ± 0.0374
	RLMRec		0.1600 ± 0.1000	0.1600 ± 0.0920	0.0000 ± 0.0000	0.0000 ± 0.0000	0.5000 ± 0.2857	0.5000 ± 0.2286	0.7185 ± 0.0514	0.7364 ± 0.0404
	Our Method		0.2200 ± 0.1100	0.2240 ± 0.1061	0.0400 ± 0.0500	0.0371 ± 0.0450	0.9286 ± 0.1071	0.8571 ± 0.1286	0.8344 ± 0.0414	0.8285 ± 0.0318

Table 4. Postprocessing ablation: MAP@1 and NDCG@1 for cold-start and warm-start user segments across three variants. The highest performance is shown in bold.

Scenario	Dataset		Goodbook		Music4All		ML-100K		ML-1M
		Metric	MAP	NDCG	MAP	NDCG	MAP	NDCG	MAP	NDCG
	Variant		MAP	NDCG	MAP	NDCG	MAP	NDCG	MAP	NDCG
Cold	Base CF		0.2200 ± 0.1100	0.2280 ± 0.1060	0.4000 ± 0.4000	0.2186 ± 0.3093	0.3684 ± 0.1579	0.4079 ± 0.1375	0.1800 ± 0.1100	0.1920 ± 0.1020
	LLM Reranked		0.1600 ± 0.1100	0.1840 ± 0.0981	0.0000 ± 0.0000	0.0000 ± 0.0000	0.1316 ± 0.1184	0.1671 ± 0.0993	0.0800 ± 0.0700	0.1040 ± 0.0781
	RRF Balanced		0.2200 ± 0.1000	0.2520 ± 0.1080	0.4000 ± 0.4000	0.0699 ± 0.0862	0.3158 ± 0.1319	0.3461 ± 0.1316	0.2200 ± 0.1100	0.2000 ± 0.1041
Warm	Base CF		0.2600 ± 0.1200	0.2560 ± 0.1180	0.2000 ± 0.3000	0.0400 ± 0.0600	0.8182 ± 0.2273	0.8182 ± 0.1727	0.6200 ± 0.1300	0.6700 ± 0.1080
	LLM Reranked		0.2400 ± 0.1100	0.2360 ± 0.1180	0.0000 ± 0.0000	0.0000 ± 0.0000	0.8182 ± 0.2273	0.8727 ± 0.0909	0.4800 ± 0.1400	0.4640 ± 0.1240
	RRF Balanced		0.2800 ± 0.1300	0.2800 ± 0.1160	0.2000 ± 0.3000	0.0400 ± 0.0600	0.9091 ± 0.1364	0.9091 ± 0.0727	0.5600 ± 0.1400	0.6140 ± 0.1200

Table 5. LLM-as-Judge evaluation of recommendation explanations (1–5 scale, mean ± SD, averaged over three judge models).

Dataset	Relevance	Faithfulness	Personalization	Coherence	Average
Goodbook	4.30 ± 0.52	4.17 ± 0.53	3.91 ± 0.62	4.37 ± 0.51	4.19 ± 0.55
Music4all	4.17 ± 0.54	4.00 ± 0.60	3.86 ± 0.68	4.42 ± 0.50	4.11 ± 0.58
ML-100k	4.31 ± 0.47	4.08 ± 0.58	3.92 ± 0.60	4.46 ± 0.49	4.19 ± 0.54
ML-1M	4.33 ± 0.50	4.15 ± 0.58	4.06 ± 0.65	4.53 ± 0.42	4.27 ± 0.55
Overall	4.28 ± 0.51	4.10 ± 0.58	3.94 ± 0.64	4.44 ± 0.48	4.19 ± 0.56

Table 6. Semantic Alignment Score (SAS) of recommendation explanations (cosine similarity, mean ± SD).

Dataset	SAS_User	SAS_Item	SAS_Combined
Goodbook	0.538 ± 0.068	0.705 ± 0.067	0.621 ± 0.043
Music4all	0.625 ± 0.068	0.749 ± 0.081	0.687 ± 0.047
ML-100k	0.563 ± 0.066	0.707 ± 0.085	0.635 ± 0.045
ML-1M	0.592 ± 0.071	0.720 ± 0.071	0.656 ± 0.040
Overall	0.579 ± 0.075	0.720 ± 0.078	0.650 ± 0.051

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Ebrat, D.; Ahmadian, S.; Rueda, L. End-to-End Personalization via Unifying LLM Agents and Graph Attention Networks for Entertainment Recommendation. Information 2026, 17, 344. https://doi.org/10.3390/info17040344

AMA Style

Ebrat D, Ahmadian S, Rueda L. End-to-End Personalization via Unifying LLM Agents and Graph Attention Networks for Entertainment Recommendation. Information. 2026; 17(4):344. https://doi.org/10.3390/info17040344

Chicago/Turabian Style

Ebrat, Danial, Sepideh Ahmadian, and Luis Rueda. 2026. "End-to-End Personalization via Unifying LLM Agents and Graph Attention Networks for Entertainment Recommendation" Information 17, no. 4: 344. https://doi.org/10.3390/info17040344

APA Style

Ebrat, D., Ahmadian, S., & Rueda, L. (2026). End-to-End Personalization via Unifying LLM Agents and Graph Attention Networks for Entertainment Recommendation. Information, 17(4), 344. https://doi.org/10.3390/info17040344

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

End-to-End Personalization via Unifying LLM Agents and Graph Attention Networks for Entertainment Recommendation †