LLM4Rec: A Comprehensive Survey on the Integration of Large Language Models in Recommender Systems—Approaches, Applications and Challenges

Shehmir, Sarama; Kashef, Rasha

doi:10.3390/fi17060252

Open AccessReview

LLM4Rec: A Comprehensive Survey on the Integration of Large Language Models in Recommender Systems—Approaches, Applications and Challenges

by

Sarama Shehmir

and

Rasha Kashef

^*

Electrical, Computer, and Biomedical Engineering, Toronto Metropolitan University, Toronto, ON M5B 2K3, Canada

^*

Author to whom correspondence should be addressed.

Future Internet 2025, 17(6), 252; https://doi.org/10.3390/fi17060252

Submission received: 4 March 2025 / Revised: 27 May 2025 / Accepted: 30 May 2025 / Published: 4 June 2025

(This article belongs to the Special Issue Deep Learning in Recommender Systems)

Download

Browse Figures

Versions Notes

Abstract

The synthesis of large language models (LLMs) and recommender systems has been a game-changer in tailored content onslaught with applications ranging from e-commerce, social media, and education to health care. This survey covers the usage of LLMs for content recommendations (LLM4Rec). LLM4Rec has opened up a whole set of challenges in terms of scale, real-time processing, and data privacy, all of which we touch upon along with potential future directions for research in areas such as multimodal recommendations and reinforcement learning for long-term engagement. This survey combines existing developments and outlines possible future developments, thus becoming a point of reference for other researchers and practitioners in developing the future of LLM-based recommendation systems.

Keywords:

large language models (LLMs); recommendation systems; LLM4Rec; generative models; discriminative models; Transformer architecture; fine-tuning; prompt tuning; recommender evaluation

1. Introduction

Recommendation systems [1,2] have become integral to user interaction across various domains, such as e-commerce, social media, healthcare, and education. These systems aim to deliver content tailored to user preferences, enhancing engagement and satisfaction. Traditionally, recommendation systems have relied on collaborative filtering, content-based filtering, or hybrid approaches. While effective, these techniques face challenges such as data sparsity, cold-start problems, and limited adaptability to rapidly changing user preferences [1,2]. The emergence of large language models (LLMs) has transformed natural language processing (NLP) by enabling sophisticated understanding, generation, and contextual reasoning. LLMs offer capabilities that extend beyond conventional systems by processing unstructured data like reviews, social media posts, and user feedback [3]. By integrating LLMs with recommendation systems (LLM4Rec), contextual awareness and generative capabilities can significantly enhance personalization and relevance. These advancements allow recommendation systems to implement the following:

Enhanced Tailoring and Context Comprehension: Interpreting subtle user signals and preferences through rich language inputs, improving the contextual relevance of recommendations [4].
Exploitation of Heterogeneous Unstructured Data: Leveraging sources such as reviews, social media content, and user comments to improve recommendation quality [5].
Generative Abilities for Novel Suggestions: Generating fresh and exploratory recommendations, particularly in media and entertainment, by suggesting items beyond the categories requested by the users.
Mitigation of Cold-Start and Data Scarcity Problems: Employing descriptive text data and contextual information to recommend unfamiliar users or items with little historical data [6].
Real-Time Adaptation to User Preferences: In fast-paced domains like news, fashion, and social media, user preferences can shift rapidly. MixRec addresses this by using a dynamic mixture-of-experts framework that continuously adjusts to users’ evolving behaviors, enabling the system to deliver timely and contextually relevant recommendations [7].

LLM4Rec represents a paradigm shift in recommendation systems, enabling more sophisticated, context-aware, and flexible recommendations. This survey provides a comprehensive overview of LLM4Rec’s current state, discussing its architecture, applications, and challenges while offering insights into future advancements.

The primary objectives of this survey are

To explore how LLM architectures have been adapted for recommender systems;
To compare their application across domains such as e-commerce, healthcare, and education;
To identify challenges and gaps in current LLM4Rec research;
To provide a structured taxonomy and benchmarking framework for future studies.

Paper Organization

This survey is structured to comprehensively analyze the integration of large language models (LLMs) in recommendation systems, covering foundational concepts, technical methodologies, applications, challenges, and future directions. Section 2 discusses materials and research methods which details the systematic literature review methodology used to collect and evaluate over 150 LLM4Rec papers from 2018 to 2024. It explains the selection criteria, search strategy, and classification procedures and is supported by a structured workflow diagram. Section 3 highlights the contributions and scope of the survey, the major areas examined, comparable works, and the challenges the research deals with. Section 4 presents an overview of traditional recommender systems, discussing collaborative filtering, content-based models, hybrid approaches, and their limitations followed by an introduction of large language models (LLMs) detailing their architecture, key Transformer-based models, and advancements over traditional techniques. We compare the use of LLMs in natural language processing (NLP) with recommendation systems, highlighting differences in data types, model architectures, evaluation metrics, and challenges. In Section 5, we describe the full LLM4Rec framework, integrating elements like LLM-based architecture, generative and discriminative reasoning techniques, and prompt tuning with tuning methods, as well as multitask learning and knowledge distillation. Moreover, this section comprises benchmark protocols, reference datasets, benchmarking metrics, and key performance indicators. Further, it describes the technical issues applicable to LLM4Rec systems such as the scalability, latency, bias, and cold-start problems, also showing the new advanced methods developed to overcome these problems, such as multimodal fusion, real-time adaptation, federated learning, and lightweight edge deployment. Section 6 explores applications of LLM4Rec across diverse domains, including e-commerce, media, education, healthcare, and social media. Section 7 synthesizes trends observed across the surveyed works. It provides a comparative analysis of prominent models, reflects on the limitations of current LLM4Rec pipelines, and identifies unresolved issues and research gaps. Finally, Section 8 concludes with practical implications for recommending system designs and future research directions.

2. Materials and Methods

This survey adopts a systematic literature review (SLR) methodology to comprehensively evaluate the intersection of large language models (LLMs) and recommender systems (LLM4Rec). A total of 150+ peer-reviewed publications and influential preprints published between 2018 and early 2024 were collected from databases such as IEEE Xplore, ACM Digital Library, arXiv, SpringerLink, and Google Scholar.

Research Strategy

The SLR process involved a multi-stage pipeline, as illustrated in Figure 1, to ensure transparency and reproducibility. The workflow included the following key steps:

Paper Identification: Keyword-based searches (e.g., “LLM4Rec”, “BERT recommender”, “multimodal recommendation”) were conducted to retrieve the relevant literature from scholarly databases.
Relevance Filtering and Inclusion: Only papers focusing on LLM-based recommendation tasks, benchmark evaluations, novel architectures, or technical challenges were retained. Duplicates and irrelevant works were excluded.
Taxonomical Classification: Selected works were categorized by architecture type (generative/discriminative), Transformer backbone (e.g., BERT, GPT, T5), modality (text, multimodal), and domain (e.g., healthcare, e-commerce).
Comparative Analysis: Each model was evaluated using standard metrics such as NDCG, Recall@K, AUC, BLEU, and latency/accuracy trade-offs. Fine-tuning strategies like prompt tuning, LoRA, and multitask learning were also compared.
Trend and Gap Identification: We analyzed recurring limitations (e.g., cold-start problems, bias, latency) and emerging solutions (e.g., retrieval-augmented generation, federated learning, multimodal fusion) to inform future directions.

This process enabled us to construct a structured, domain-agnostic synthesis of the LLM4Rec landscape, identifying both established practices and underexplored areas that warrant further research.

3. Contributions and Scope of the Survey

The development of LLM-powered recommendation systems has garnered significant interest, yet comprehensive reviews addressing the state of the art remain limited. This survey aims to fill this gap by presenting a thorough overview of LLM4Rec, emphasizing its methodologies, applications, and future directions. The primary contributions of this survey include

Applications Across Diverse Domains: We examine using LLM4Rec across multiple domains, including e-commerce, media, education, and healthcare [6,8]. This illustrates the versatility of LLMs in addressing domain-specific challenges and adapting to varied user behaviours.
Evaluation Framework and Benchmarking: We outline standard datasets, metrics, and practices for evaluating LLM4Rec systems, enabling meaningful comparisons with traditional recommendation approaches [9,10].
Technical Challenges and Research Possibilities: We discuss challenges such as scalability, privacy, and multimodal data integration and explore emerging areas like reinforcement learning for long-term user modelling and improving model efficiency [11,12].

This survey offers what we believe is the first comprehensive cross-domain synthesis of LLM-based recommender systems (LLM4Rec). Unlike prior work that tends to focus on specific application areas or isolated technical components, our approach brings together insights from diverse domains—including e-commerce, media, education, and healthcare—highlighting how LLMs are adapted to distinct recommendation challenges. In addition to this broad coverage, we introduce a unified evaluation framework that facilitates meaningful comparisons across models and contexts. Finally, by identifying underexplored yet promising directions such as reinforcement learning for long-term user modeling and multimodal recommendation strategies, we aim to provide a roadmap for future research. Together, these contributions position our work as a novel and timely resource for advancing the state of LLM4Rec.

Guiding Research Questions

This survey is anchored by a central inquiry into the evolving role of large language models (LLMs) in recommender systems. Specifically, it seeks to understand how these models can be leveraged to enable more personalized, scalable, and context-aware recommendations across a range of application domains and data modalities.

Primary Research Question:

How can large language models (LLMs) be effectively integrated into recommender systems to enhance personalization, ensure scalability, and support context-sensitive decision making across diverse domains and input types?

To support this goal, the following sub-questions are explored:

Sub Q1: In what ways are LLM architectures—such as BERT, GPT, and T5—adapted, fine-tuned, or prompt-engineered to handle core recommendation tasks including sequential prediction, relevance ranking, and content generation?
Sub Q2: What are the major deployment scenarios and use cases for LLM4Rec systems in sectors like e-commerce, healthcare, education, and social media? What domain-specific constraints (e.g., multilingual input, ethical considerations, regulatory frameworks) influence their design and performance?
Sub Q3: Which technical limitations persist in current LLM4Rec pipelines—particularly regarding computational efficiency, fairness, data sparsity, cold-start issues, and user privacy—and how are these being addressed in state-of-the-art research?
Sub Q4: How do LLM-based recommender models compare to traditional or hybrid systems in terms of key performance indicators such as NDCG, Recall@K, diversity, latency, and real-time responsiveness?

These questions guide the structure and analysis throughout the survey, informing both the taxonomy of existing models and the identification of open research challenges.

4. Foundations of Recommendation Systems and LLM Integration

Recommender systems have been implemented in various domains, utilizing users’ historical activities to provide them with customized recommendations. The primitive methods that shaped the development of recommendation engines are summarized in this section, looking at both positive and negative aspects. Recommender systems, or recommendation engines, have evolved over time and are now based on collaborative filtering, content-based filtering, and hybrid systems.

4.1. Traditional Recommender Systems

Conventional recommender systems typically employ one of three strategies: collaborative filtering, content-based filtering, or hybrid approaches.

Collaborative filtering is one of the most widely used approaches in conventional recommendation algorithms. It is based on the premise that users who have interactions with or preferred similar items will behave similarly in the future. This technique can be divided into two forms: user-based [13] and item-based.

By assuming that users will have similar preferences based on similar ratings or interactions, user-based collaborative filtering [13] identifies users whose profiles closely match those of a target user. This method makes it feasible to calculate a product’s rating by looking at the ratings of other users. The user–item interaction matrix’s sparsity, however, makes it difficult to find comparable users who can offer trustworthy recommendations. Assuming that users who have rated one item are likely to have rated other related items, item-based collaborative filtering is different from user-based collaborative filtering in that it concentrates on the relationships between items. Since there is typically more data regarding interactions with items than with specific users, this method is less vulnerable to sparsity problems.

Content-based filtering: While collaborative filtering relies on user–item interaction data, content-based filtering uses the attributes of items and users to make recommendations. Analysing item characteristics and user preferences helps content-based filtering to generate recommendations. In cold-start situations it works well and can add new objects without user involvement. It can, however, suffer from overspecialisation and depends on thorough item attribute information, which might not always be accessible. For some situations, content-based filtering presents a more flexible solution overall [14].

Hybrid filtering was introduced to mitigate the weaknesses of collaborative and content-based filtering. These systems combine the strengths of both approaches, leading to more accurate and robust recommendations. A typical hybrid approach blends collaborative filtering with content-based methods, leveraging user interaction data and item attributes for prediction. A notable example of a hybrid recommender system is Netflix’s recommendation engine. Netflix uses a combination of collaborative filtering, content-based filtering, and business rules to provide personalized recommendations to its users. This hybrid system allows Netflix to provide accurate recommendations even in cases where the data from the user–item interaction is sparse or incomplete [15]. Hybrid systems are highly versatile, but they also introduce challenges in terms of complexity. Combining multiple models requires significant computational resources, mainly when processing large-scale datasets. Designing a hybrid system that effectively integrates different techniques without redundancy is also complex.

Traditional recommender system architectures face several critical challenges including sparsity [13], scalability [16], cold-start problems [14], static preferences, and lack of interpretability [16].

4.2. Large Language Models: Capabilities and Evolution

The introduction of LLMs such as BERT, GPT, and T5 revolutionized natural language understanding by capturing deep semantic relationships using self-attention mechanisms. Transformer-based models have become foundational in NLP due to their scalability, contextual reasoning, and transferability. To provide a comprehensive comparative analysis, a series of differentiated tables—namely, Table 1, Table 2, Table 3 and Table 4—accompany this section. These tables systematically evaluate the discussed models in terms of their design objectives, training methodologies, key application domains, core strengths, limitations, and the evaluation metrics used to benchmark their performance.

4.2.1. Bidirectional Encoder Representations from Transformers (BERT)

BERT has significantly advanced NLP by allowing the model to capture word context from both directions in a sentence, unlike earlier models that processed text in only one direction [17]. This bidirectional nature enhances BERT’s understanding of complex language structures. BERT uses masked language modelling during pretraining, where random words are hidden, and the model predicts them based on the surrounding context. This helps it learn deep language patterns that can be applied across various NLP tasks [17]. A key strength of BERT is its transfer learning capability. After pretraining on large datasets, BERT can be fine-tuned for tasks such as sentiment analysis, question answering, and text classification, making it highly versatile [17,18]. It has also been applied to domain-specific tasks, such as analyzing electronic health records [19,20]. Another critical feature is the use of attention mechanisms, which allows it to weigh the importance of words in a sentence, improving precision in tasks such as text classification [21]. Despite its strengths, BERT also introduces challenges, particularly around data privacy and biases, especially in sensitive areas like healthcare [19]. Addressing these issues is essential for the responsible deployment of BERT-based systems.

4.2.2. Generative Pre-Trained Transformer (GPT)

GPT (Generative Pre-trained Transformer) is highly effective in NLP due to its ability to generate coherent text and understand context using the Transformer architecture. Its self-attention mechanism captures dependencies between words, enabling GPT to perform complex tasks efficiently [22]. The deep neural network structure enhances GPT’s performance across various NLP applications [22]. Another advantage of GPT is its multilingual capabilities. It outperforms traditional methods in analyzing text across multiple languages, making it valuable for cross-linguistic research and for handling lesser-spoken languages [23]. Its ability to function without additional training data further adds to its accessibility, allowing even non-experts to use it effectively with simple prompts [23]. GPT also excels in advanced text processing, quickly summarizing large volumes of text and identifying key concepts, which aids in tasks like literature reviews and recognizing emerging trends in research [24]. Despite these strengths, challenges such as processing limitations and ethical concerns remain, which need to be addressed for responsible and efficient deployment [25]. GPT-based models have become valuable in advancing recommender systems by leveraging their natural language generation and processing abilities. These models, especially when tailored or integrated with other frameworks, enhance the precision of recommendations and user experience across multiple fields. Various approaches to integrating GPT into recommender systems offer unique improvements and insights.

Table 1. Comparison of Transformer-based language models. Source: Author’s own design.

Aspect	BERT	GPT	T5	RoBERTa	XLNet	ALBERT
Training Objective	MLM + NSP	Causal Language Modelling	Text-to-Text Generation	MLM (no NSP)	Permutation Language Modelling	MLM + Sentence Order Prediction
Pre-training Method	Autoencoding	Autoregressive decoding	Seq2Seq Text Generation	Longer MLM with dynamic masking	Permuted language modeling (Autoregressive + Autoencoding)	Factorized embeddings + shared weights
Bidirectional Context	Yes	No	Encoder–decoder (Partial)	Yes	Yes (via permutations)	Yes
Multilingual Support	mBERT available	GPT-3+ multilingual	mT5 variant	XLM-R variant available	Not standard	Requires adaptation
Fine-tuning Requirement	Required	Often not required	Required but efficient	Required for most tasks	Required	Required
Key Applications	QA, NER, classification	Generation, Q&A, dialogue	All NLP tasks unified	Sentiment, news classification	Classification, recsys, long-text analysis	Classification, low-resource NLP
Strengths	Transfer learning, context-rich embeddings	Few-shot capable, coherent generation	Unified architecture, efficient finetuning	Improved pretraining, better generalization	Bidirectional + long-range modeling	Lightweight, faster training, memory-efficient
Limitations	Computation-heavy, privacy bias	Bias + hallucinations	Large pretraining cost, model size	Still computation-heavy, no generation	Complex training, high computation	Slightly less performance on multilingual tasks
Evaluation Metrics	GLUE (80.5%), SQuAD v1.1 (F1: 93.2), SQuAD v2.0 (F1: 83.1)	GLUE (72.8), RACE (Accuracy: 59.0), Story Cloze Test (Accuracy: 86.5)	GLUE (89.3), SQuAD v1.1 (F1: 92.8), CNN/DailyMail (ROUGE-L: 36.6)	GLUE (88.5), SQuAD v1.1 (F1: 94.6), RACE (Accuracy: 89.8)	GLUE (89.8), SQuAD v1.1 (F1: 94.5), RACE (Accuracy: 89.8)	GLUE (89.4), SQuAD v2.0 (F1: 89.1), RACE (Accuracy: 89.4)

4.2.3. Text-to-Text Transfer Transformer (T5)

The Text-to-Text Transfer Transformer (T5) has emerged as a versatile natural language processing (NLP) model, mainly due to its innovative text-to-text framework. By treating all NLP tasks—from translation to sentiment analysis—as text generation problems, T5 simplifies architecture and offers a unified approach to various applications [26,27]. This framework allows for easy fine-tuning, significantly improving task-specific performance without needing custom models [26]. T5’s effectiveness extends to multilingual and language-specific adaptations. The multilingual version, mT5, has shown impressive results across different languages, while language-specific models, such as the Indonesian version, achieve comparable performance to larger models but with more efficient memory and processing demands [27]. This adaptability makes T5 highly efficient, especially when computational resources are limited [27]. T5 also excels in resource utilization. Its architecture supports fine-tuning with less data, making it computationally efficient compared to models trained from scratch [28]. Despite these advantages, challenges remain in managing the significant computational resources required for pre-training large Transformer models like T5, emphasizing the need for ongoing research into more efficient architectures [29].

4.2.4. XLNet

XLNet stands out for its advantages in text classification and recommendation tasks, owing to its hybrid architecture that integrates autoregressive and autoencoding capabilities. XLNet understands complex language patterns and user behaviour by effectively capturing bidirectional context, making it ideal for tasks requiring deep comprehension and precise predictions. In text classification, XLNet’s permutation-based training enables the model to capture bidirectional context more effectively than BERT or RoBERTa. This advantage is crucial for tasks involving intricate semantic relationships, as XLNet demonstrates superior performance on datasets with complex language structures [30,31]. Its ability to handle long-range dependencies enhances its performance in tasks like sentiment analysis and information retrieval, allowing it to outperform traditional models when interpreting extended text sequences [30,31]. Moreover, XLNet’s robustness across various datasets, such as AG News and TREC-QA, highlights its adaptability and high performance in diverse classification challenges [31]. In recommendation tasks, XLNet’s bidirectional architecture enables it to model both long-term and short-term user interests effectively, making it highly suitable for dynamic environments like e-commerce, where user preferences frequently shift [32,33,34]. Furthermore, XLNet efficiently incorporates side information, such as demographic data or user behaviour patterns, improving the relevance and personalization of recommendations, a limitation of many traditional models [34]. Despite its many strengths, XLNet does come with increased computational complexity and resource demands. In scenarios where computational efficiency is paramount, models like BERT and RoBERTa may still be preferred due to their ability to balance performance and resource consumption [35].

4.2.5. RoBERTa (A Robustly Optimized BERT Pretraining Approach)

RoBERTa enhances the performance of Transformer models through optimized pretraining strategies, making it particularly effective for text classification tasks. Its architectural advancements, which include extended training with larger datasets and batch sizes, distinguish it from models like BERT. Additionally, the omission of the following sentence prediction task allows RoBERTa to concentrate on understanding individual sentence structures, which is essential for effective text classification. For pretraining optimization, RoBERTa is trained on a more extensive dataset and for more iterations than BERT. This approach allows it to capture nuanced language patterns more effectively, aided by dynamic masking that varies the masked tokens across epochs, enhancing generalization across different domains [36]. Removing the next sentence prediction task enables RoBERTa to understand better sentence relationships, which is particularly advantageous for tasks requiring intricate semantic analysis [30]. RoBERTa also employs larger batch sizes and a refined learning rate schedule, contributing to its superior performance in various text classification benchmarks, such as news and scientific literature classification, where it outperforms other Transformer models [37,38]. Its robust cross-domain performance, as demonstrated in benchmark tasks such as GLUE and RACE [36], highlights its capability to comprehend diverse text types. This makes it a suitable candidate for downstream applications such as suggestion detection in online reviews.

4.2.6. ALBERT

ALBERT (A Lite BERT) is a Transformer-based model engineered to reduce computational costs and memory usage while maintaining performance comparable to BERT. It accomplishes this through several architectural innovations, making it particularly effective for text classification and recommendation tasks. One of ALBERT’s key strategies is its parameter-sharing technique across layers, significantly decreasing the total number of parameters compared to that of BERT. By utilizing the same parameters for feed-forward and attention layers across all Transformer blocks, ALBERT creates a more compact model [39]. Additionally, ALBERT employs factorized embedding parameterization, decoupling the hidden layer size from the vocabulary embedding size. This approach results in a smaller embedding matrix, minimizing the model’s size without compromising performance [39]. The reduced parameter size lessens memory requirements and accelerates training and inference times, enhancing ALBERT’s efficiency for large-scale text classification tasks [39]. Its architecture allows for high accuracy while maintaining scalability, which is especially advantageous for recommendation systems that need to process large datasets efficiently [39]. Despite its lightweight design, ALBERT demonstrates performance on par with larger models such as BERT and RoBERTa across various NLP tasks, including text classification and language identification [35,39]. Its capability to effectively capture contextual relationships makes it suitable for nuanced text analysis, including sentiment analysis and spam detection [30,40,41,42]. While ALBERT provides significant advantages in efficiency and scalability, it may not consistently outperform larger models like XLM-RoBERTa [43] in tasks requiring extensive multilingual capabilities or diverse language identification. This reflects the trade-off between model size and performance, which is crucial when selecting the appropriate model for specific NLP applications [39].

4.2.7. Electra (Efficiently Learning an Encoder That Classifies Token Replacements Accurately)

Electra introduces significant advancements in NLP through its Replaced Token Detection (RTD) mechanism. Unlike BERT’s masked language model (MLM) approach, which only learns from a subset of tokens, Electra learns from all input tokens, making it considerably more sample-efficient [44]. This efficiency allows Electra to achieve performance comparable to or better than models like RoBERTa and XLNet while requiring significantly fewer computational resources [44]. In zero-shot learning tasks, Electra has demonstrated an 8.4% and 13.7% improvement over BERT and RoBERTa [45]. Notably, it achieved 90.1% accuracy on SST-2 without using any training data, underscoring its capabilities when training data is unavailable [45]. While GPT excels in text generation and T5 offers versatility through its text-to-text framework, Electra surpasses GPT on the GLUE benchmark, despite using fewer computational resources [44]. However, recent research suggests further improvements are possible by refining replacement sampling techniques through hardness prediction and focal loss to address inefficiencies in the model’s current method [46].

Table 2. Comparison of Transformer-based models: Electra, BART, DistilBERT, Megatron-LM. Source: Author’s own design.

Aspect	Electra	BART	DistilBERT	Megatron-LM
Training Objective	Replaced Token Detection (RTD)	Denoising Autoencoding	Distilled Masked Language Modelling	Causal Language Modelling (Autoregressive)
Pre-training Method	Generator-discriminator with full token prediction	Noise-corruption and reconstruction	Knowledge distillation from BERT	Parallel training: tensor, pipeline, and data parallelism
Bidirectional Context	Yes (via discriminator)	Yes (encoder), No (decoder)	Yes (via distilled BERT)	No (unidirectional)
Multilingual Support	Not standard	Available via mBART	Available (e.g., distil-mBERT)	Varies by implementation
Fine-tuning Requirement	Yes	Yes	Yes (lightweight)	Often used for pretraining large LLMs
Key Applications	Text classification, QA, zero-shot NLU	Summarization, translation, recommender systems	Edge deployment, real-time NLP, cybersecurity	Scalable LLMs (e.g., GPT-3-like), RecSysLLM, few-shot learning
Strengths	Sample-efficient; outperforms BERT/RoBERTa on GLUE	Combines strengths of BERT + GPT; robust for sequence generation	60% smaller than BERT-base, retains 90% of BERT’s performance; fast	Enables trillion-parameter models; advanced parallelism for training efficiency
Limitations	Not ideal for generation; training method complexity	High computation needs; domain-specific challenges in real-time systems	Slightly less accurate than BERT in some tasks	Massive GPU and memory requirements; sensitive to prompts
Evaluation Metrics	GLUE (Acc, MCC, Corr), SQuAD (F1: 90.1)	GLUE, SQuAD, ROUGE (CNN: ROUGE-L: 44.16)	GLUE (97% of BERT), SQuAD (F1: 89.2, EM: 86.9)	Perplexity (WikiText103: 8.63), Accuracy (LAMBADA: 68.5%)

4.2.8. BART (Bidirectional and Auto-Regressive Transformers)

Combining the strengths of BERT’s bidirectional encoding and GPT’s auto-regressive decoding, BART offers a versatile solution for various NLP tasks. Its architecture enables strong performance in comprehension and generation tasks, such as summarization and translation, through its denoising autoencoder framework [47]. BART excels in text generation, outperforming models like RoBERTa and improving machine translation with a 1.1 BLEU score increase [47]. While BERT excels at understanding context and GPT is known for its strength in generating text, BART brings the best of both worlds by combining these capabilities. This makes it particularly effective for tasks like summarization and dialogue generation [47]. In the realm of recommender systems, BART’s bidirectional encoding allows it to model user interaction patterns effectively, much like BERT4Rec [6]. At the same time, its auto-regressive decoding helps bridge the gap between training and inference, leading to more accurate sequential predictions [47]. By using denoising strategies such as token deletion and infilling, BART is also better equipped to handle sparse data, boosting the reliability of recommendation models [47].

4.2.9. DistilBERT

DistilBERT, a distilled and more efficient variant of the original BERT model, offers key benefits that make it an appealing choice for natural language processing (NLP) tasks, particularly in resource-limited environments. Through the knowledge distillation process, DistilBERT compresses BERT’s capabilities into a smaller, faster model that retains comparable performance levels while reducing computational requirements. These efficiencies make DistilBERT a practical solution across diverse applications. DistilBERT’s primary benefits over BERT include a reduced model size, increased computational efficiency, comparable performance with faster training, and versatility across applications. DistilBERT’s smaller model size allows for effective deployment on devices with limited computational power. SparseDistilBERT, for example, reduces BERT’s model size by 60% while preserving 90% of its performance and requires only 40% of the original training time [48]. This parameter reduction results in lower memory usage and faster inference, which is critical for real-time banking applications such as customer service automation. DistilBERT has achieved a high accuracy and F1 score [49]. Despite its smaller architecture, DistilBERT achieves performance levels close to BERT. DistilBERT’s versatility allows it to perform well across domains, including cybersecurity, where it accurately detected malicious PowerShell scripts [50], and lightweight sentiment analysis, enhancing accuracy on test data [51,52]. The model’s lower computational demands make it ideal for real-world applications, such as deployment on client-side web applications, edge devices, or embedded systems. This broad applicability makes it suitable for industries such as finance and cybersecurity.

4.2.10. Megatron-LM

Megatron-LM is a framework designed to efficiently train large-scale language models using advanced model parallelism techniques, addressing GPU memory constraints and extensive computational requirements. By employing a combination of tensor, pipeline, and data parallelism, Megatron-LM scales models to trillions of parameters across thousands of GPUs, thereby enhancing training efficiency and throughput. This framework has significantly contributed to advances in natural language processing (NLP) tasks, enabling the training of very large Transformer models. Megatron-LM’s parallelism techniques include tensor, pipeline, and data parallelism. Tensor parallelism splits model layers across GPUs, allowing each GPU to process part of the model’s computations, thus mitigating memory limitations [53]. Pipeline parallelism uses an interleaved schedule that divides the model into stages, processing different micro-batches concurrently, which improves throughput by over 10 % with minimal memory overhead [53]. Data parallelism distributes data batches across GPUs, complementing tensor and pipeline parallelism to enhance scalability further [53]. Notable applications of Megatron-LM include Megatron-Turing NLG 530B, a 530-billion parameter model that achieves strong performance in zero-, one-, and few-shot learning tasks, further showcasing its capability to handle extensive language models [54]. Moreover, MEGATRON-CNTRL integrates external knowledge bases, enhancing controllability in text generation and diversifying content quality [55]. Megatron-LM and other large language models (LLMs) offer advanced capabilities for recommender systems by capturing complex semantic relationships and user preferences. LLM-based recommenders such as RecSysLLM incorporate reasoning capabilities to improve recommendation quality by integrating domain-specific knowledge and common-sense reasoning [56]. Despite their capabilities, LLMs like Megatron-LM face challenges, including sensitivity to input prompts and significant computational requirements. Methods such as hierarchical and recurrent summarization manage text-rich data in sequential recommendations, addressing some of these limitations [57].

4.2.11. ERNIE (Enhanced Representation Through Knowledge Integration)

By incorporating external knowledge, ERNIE improves natural language processing (NLP) task performance. Based on BERT, ERNIE introduces innovative knowledge masking techniques, such as entity-level and phrase-level masking, to enrich language understanding. ERNIE employs entity-level masking to mask entire entities, often multi-word constructs. This encourages the model to develop representations that encapsulate the semantics of these entities, thus enhancing its proficiency in tasks like named entity recognition and question answering [58]. Complementing this, phrase-level masking conceals entire phrases that function as conceptual units. This enables ERNIE to interpret better sentence relationships, improving its performance in semantic similarity and sentiment analysis tasks. ERNIE’s efficacy is demonstrated through superior results across various Chinese NLP tasks, such as natural language inference, semantic similarity, named entity recognition, sentiment analysis, and question answering. This underscores the effectiveness of integrating structured knowledge to achieve advanced comprehension and inference. Additionally, ERNIE’s knowledge inference capacity is particularly pronounced in cloze tests, predicting missing words by leveraging contextual and external knowledge. Comparative studies with other knowledge-enhanced models reveal diverse approaches to knowledge integration. While ERNIE and K-Adapter incorporate distinct types of knowledge, the degree and method of integration vary, suggesting that model performance can be significantly influenced by the choice of knowledge sources and integration strategies [59].

4.2.12. FLAN (Fine-Tuned LAnguage Net)

FLAN (Fine-tuned LAnguage Net) is a language model by Google that improves zero-shot and few-shot learning through instruction tuning. This tuning helps in memorizing the previous model and can easily generalize into different natural language instructions, assisting it in performing unseen tasks in the future. In contrast to OLTP models that need specific tuning for each task, FLAN task tuning uses keywords and phrases to teach the model to make decisions on tasks it has never faced before. The above is why it surpasses other models, such as the GPT-3, after constructing the FLAN during zero-shot learning in qualitative tasks. The features and contributions of FLAN are buried in the possibilities of instruction tuning, impressively showing a pathway to the capabilities of instruction-tuned models that require less and less user help [60].

Table 3. Comparison of Transformer-based models: ERNIE, FLAN, DeBERTa, UniLM. Source: Author’s own design.

Aspect	ERNIE	FLAN	DeBERTa	UniLM
Training Objective	Knowledge-Enhanced MLM (entity and phrase masking)	Instruction-tuned language modelling	Decoding-enhanced MLM with disentangled attention	Unified language modelling (multi-mask mode)
Pre-training Method	Entity- and phrase-level masking with external knowledge	Task-specific instruction fine-tuning	Content and position disentangled representations + enhanced decoding	Bidirectional, unidirectional, and seq2seq masking patterns
Bidirectional Context	Yes (BERT-based)	Yes (fine-tuned on prompts)	Yes (with disentangled attention)	Supports multiple attention modes
Multilingual Support	Chinese (ERNIE 1.0); extended in ERNIE 2.0+	Yes (FLAN-T5, multilingual-tuned)	Yes (via XLM-DeBERTa)	Not standard, but adaptable
Fine-tuning Requirement	Yes (for knowledge-specific tasks)	Often not needed (zero/few-shot capable)	Required but sample-efficient	Required per task type
Key Applications	QA, semantic similarity, NER, cloze prediction	Generalization across unseen instructions/tasks	Classification, QA, NLI, SQuAD	Translation, summarization, QA, generation
Strengths	Integrates structured knowledge for enhanced reasoning	Strong zero/few-shot learning, instruction-following	State-of-the-art contextual representation; improves over BERT/RoBERTa	Unified architecture for NLU and NLG
Limitations	Domain-specific models; limited multilinguality in base version	Instruction design impacts performance; lacks generative flexibility	More computation than BERT; not widely adopted in all toolchains	Complexity in managing multiple mask types; not plug-and-play
Evaluation Metrics	Outperforms BERT and XLNet on GLUE benchmark and 16 English tasks	Surpasses zero-shot GPT-3 on 20 of 25 tasks evaluated	Achieves state-of-the-art results on SuperGLUE benchmark	Improves CNN/DailyMail ROUGE-L to 40.51 and Gigaword ROUGE-L to 35.75

4.2.13. DeBERTa (Decoding-Enhanced BERT with Disentangled Attention)

DeBERTa, or the Decoding-enhanced BERT with disentangled attention, is a transformed-based model extending BERT and RoBERTa by two new mechanisms: (1) DeBERTa allows for more attention weight calculation through unifying content-position embedding via embedding the content and relative position vectors separately; (2) Improved Mask Prediction: To make the pre-training process of DeBERTa more effective, absolute position embeddings are added to the decoding layer in the prediction of masked inputs. With these innovations, DeBERTa can reach new records, equalling benchmarks such as GLUE and SQuAD, also being considerably more accurate and efficient than its predecessors [61].

4.2.14. UniLM (Unified Language Model)

UniLM (Unified Language Model) is a Transformer-based model trained to work on natural language tasks, including understanding and generation. It consists of a common Transformer architecture with self-attention masks for different purposes, such as bidirectional masking For language classification and other NLP-related tasks and unidirectional masking for language generation tasks, including text generation, or sequence-to-sequence masking, for translation and summarization. This flexibility allows UniLM to be used in a range of natural language processing tasks, and it performed well in the translation, question-answering, and summarization tests [62].

4.2.15. CTRL (Conditional Transformer Language Model)

The CTRL permits controllable text generation utilizing control codes as conditions (control codes) to enhance the outcome generated by the model. CTRL is suitable for applications in creative texts, as it was trained on a data set with annotated control codes on the output generated and could create coherent domain-specific texts. This controlled generation of outputs about context or any situation differentiates it from other language models, such as GPT models [63].

Table 4. Comparison of Transformer-based models: CTRL, LaMDA, GLaM, CLIP, DALL·E. Source: Author’s own design.

Aspect	CTRL	LaMDA	GLaM	CLIP	DALL·E
Training Objective	Conditional generation using control codes	Dialogue-focused language modeling	Sparse mixture of expert language modeling	Contrastive language-image pretraining	Text-to-image generation
Pre-training Method	Supervised with control code annotations	Pretraining on dialogue-style corpora with safety filters	Autoregressive with sparse expert activation	Contrastive loss on image–text pairs	Autoregressive generation with discrete VQ-VAE image tokens
Bidirectional Context	Yes (encoder-style)	Yes, via optimized attention	No (autoregressive)	No (separate image and text encoders)	No (autoregressive image decoder)
Multimodal Support	No	No	No	Yes (image + text)	Yes (generates images from text)
Fine-tuning Requirement	Required for new control codes/domains	Fine-tuned for conversational quality	Not typical (few-shot capable)	Rarely fine-tuned (zero-shot)	Prompt-based; fine-tuning uncommon
Key Applications	Domain-specific text control (e.g., news, reviews)	Human-like multi-turn dialogue	Efficient large-scale language modeling	Zero-shot image classification, content understanding	Creative image synthesis from text prompts
Strengths	Control over generation content and domain	Sensible, specific, and engaging dialogue responses	Trillion-scale sparse model with high efficiency	Connects vision and language, zero-shot capable	High-quality, text-driven image generation
Limitations	Restricted to known control codes; limited flexibility	Still prone to hallucination; needs strong moderation	Sparse activation introduces architectural complexity	Limited generative capabilities; sensitivity to text encoding	Requires extensive compute; less robust for abstract prompts
Evaluation Metrics	Perplexity on WikiText-103: 62.3	Human Eval (SSI): Sensibleness: 92.3%, Specificity: 91.7%, Interestingness: 86.5%	SuperGLUE Accuracy: 80.4%, BERTScore: 0.89	ImageNet Zero-shot Accuracy: 76.2%, CLIPScore: 0.78	FID: 8.6, SSIM: 0.92, PSNR: 24.1, CLIPScore: 0.83

4.2.16. LaMDA (Language Model for Dialogue Applications)

Google made LaMDA, a language model for various open-ended dialogue applications. Such models are generally characterized by their tendency to ignore the themes of the conversation or copy and paste the same response; however, LaMDA focuses on dialogue data and is applied to response generation by training it on data which has (1) Sensibility: it is ppropriate within the context; (2) Specificity: it is focused on the question or input; and is (3) Engaging: it fulfils the role of the conversation. The work in [64] provided sufficient proof that all LaMDA training processes make it perform adequately by being human-like and engaging in conversation.

4.2.17. GLaM (Gated Language Model)

GLaM (Gated Language Model) is a new architecture known as a sparse mixture of experts for neural networks. The model activates a subset of the model’s parameters on each forward pass. This enhances the efficiency of the computational resources and simultaneously helps to preserve the best models in natural language understanding and generation tasks. GLaM has good scalability, making it applicable for the larger language models in practical use [65].

4.2.18. Multimodal Models: CLIP and DALL-E

CLIP (Contrastive Language–Image Pretraining) is described as an AI image and text generator that uses images and natural languages to train. It accepts images and translates them to text, producing a zero-shot classification, making it useful in many areas without necessary preprocessing [66]. DALL-E takes a description in the form of a text and creates matching high-fidelity images or images, showcasing the versatility of both textual and visual segment integration. It helps create images and other design projects [67].

4.3. LLMs in NLP vs. Recommendation

While both natural language processing (NLP) and recommendation systems utilize large language models (LLMs), they differ in primary objectives, data methodologies, and technical structures. Within linguistics, scrutinizing speech comprehension and production addresses issues such as sentiment analysis, summarization, or translation [3,17]. BERT [3], GPT [68], and T5 [26] have all mastered these tasks because of their text structure and semantic understanding of natural language.

On the other hand, recommender systems focus on predicting users’ interests and creating tailored suggestions for specific items. This requires modeling user behaviors, their interactions, and historical data within temporal frameworks. The BERT4Rec [6] extension that uses BERT for sequential recommendation is one such adapted LLM, as well as ChatGPT-based systems [69] which enable users to search for items in an e-commerce context through dialogue.

Data Modalities and Pretraining Objectives:
Most LLMs are pretrained on large unstructured corpora such as Wikipedia, where they learn from the structure and semantics of free-form text [26]. In contrast, recommendation systems rely on semi-structured and structured inputs, including user–item interaction logs, ratings, metadata, and reviews [70]. These data types require different pretraining objectives. For example, BERT4Rec [6] employs masked language modeling tailored to sequential behavior, while CoLLM [71] leverages rich textual metadata for generative recommendation.
Architectural Adaptations: Transformers remain the backbone of NLP due to their capacity to model long-range dependencies [22]. T5 reformulates diverse tasks into a unified text-to-text schema [26]. In recommendation systems, architecture design shifts toward hybrid or specialized models. XLNet4Rec [32,34] focuses on autoregressive learning over behavior sequences, while GraphRipple [72] integrates graph-based user–item relationships for deeper contextual modeling.
Evaluation Criteria: Evaluation frameworks differ notably. NLP tasks employ BLEU for translation [73], ROUGE for summarization [74], and perplexity for generative fluency. In contrast, recommender systems use metrics such as NDCG, Recall, Precision, and CTR to assess ranking and personalization quality [14]. BERT4Rec demonstrates improved NDCG over collaborative filtering baselines [6], and ChatGPT has shown promising gains in CTR when adapted for dialogue-based recommendation [69].
Context Representation: NLP systems derive context from linguistic dependencies across tokens and sentences [3]. Advances like DeBERTa enhance contextual encoding through disentangled attention [61]. Recommendation systems, on the other hand, use behavioral context such as clickstreams, timestamps, and user intent history. CoLLM [71] integrates such multimodal context to adapt recommendations to evolving user needs.
Challenges within Domains: NLP faces issues including ambiguity, polysemy, and generalization across domains [75,76]. Recommendation systems contend with data sparsity, cold-start problems, and real-time responsiveness [77]. UPRec [78] introduces user-aware pretraining to combat sparsity, while GPT4Rec [4] employs prompt-based tuning to efficiently personalize recommendations in real time with low overhead.

Foundational insights guide our understanding of how LLMs can be customized for use in recommendations, which we will be the focus of the next section.

5. Architecture, Optimization, and Technical Challenges in LLM4Rec

Building upon the foundational architectures introduced in Section 4.2, this section explores how large language models (LLMs) such as BERT, GPT, and T5 are adapted for recommendation tasks. Rather than redefining these models, we emphasize their integration strategies within LLM4Rec systems, ranging from generative conversational agents (e.g., GPT4Rec) to encoder-based sequential recommenders (e.g., BERT4Rec). These architectural adaptations are contextualized within two major application paradigms, namely discriminative and generative, each exploiting different strengths of Transformer-based models. The focus shifts from model mechanics to functional suitability, enabling a targeted analysis of how LLMs meet the domain-specific demands of recommendation systems.

LLM4Rec systems can be broadly categorized into discriminative and generative paradigms based on their functional approach and use cases.

5.1. Paradigms of LLM4Rec

5.1.1. Discriminative Paradigm

Discriminative LLM4Rec systems focus on identifying or ranking items that best match a user’s preferences based on given data. These models excel in tasks where classification, ranking, or prediction is the primary objective. For example, BERT4Rec fine-tunes a BERT-based architecture to predict the next item in a user’s sequence, leveraging the model’s bidirectional encoding to understand user–item interactions effectively. Examples of discriminative LLM4Rec systems include (1) BERT4Rec, which utilizes masked language modeling to predict the next item in sequential recommendations and (2) PALR, which focuses on learning personalization-aware ranking mechanisms for user–item interactions. Key characteristics of the discriminative paradigm include

Task Orientation: Designed for tasks like next-item prediction, click-through rate (CTR) estimation, and personalized ranking.
Fine-tuning: Most discriminative systems require supervised fine-tuning on domain-specific datasets to optimize performance.
Precision: These models are highly accurate for structured and feature-rich recommendation tasks.

5.1.2. Generative Paradigm

Generative LLM4Rec systems focus on producing outputs, such as generating personalized recommendations, simulating user interactions, or crafting explanatory narratives for recommendations. These models are beneficial in conversational systems and scenarios requiring open-ended responses. Examples of generative LLM4Rec systems include (1) GPT4Rec, which is fine-tuned to generate user-specific recommendations based on their history and preferences, and (2) RecMind, which simulates user–agent interactions to explore hypothetical user behaviors and preferences. Key characteristics of the generative paradigm include

Flexibility: It can handle diverse tasks, including multi-modal recommendations, conversational agents, and open-ended suggestions.
Pre-Trained Knowledge: Generative systems often perform well in no-tuning scenarios, leveraging their pre-trained knowledge for zero-shot or few-shot learning.
Creativity: These models can go beyond traditional recommendations, suggesting novel items or categories.

The discriminative and generative paradigms offer complementary strengths, making them suitable for different recommendation scenarios. A comparison is provided in Table 5. Combining insights from both paradigms allows hybrid approaches to harness the strengths of generative and discriminative models, paving the way for more robust and versatile recommendation systems.

5.2. Architecture and Design of LLM4Rec

The architecture of LLM4Rec systems as seen in Figure 2 integrates the core functionalities of LLMs with specific design considerations for recommendation tasks. Integrating user and item representations is essential for accurate predictions in LLM4Rec.

Latent Representations:User embeddings are derived from interaction histories, while item embeddings incorporate features like product metadata and textual reviews [81].
Attention-Based Integration: Models like XLNet4Rec capture sequential user behaviors and align them with item attributes for enhanced personalization [34].

Traditional embedding techniques focus only on static features, while LLM-based methods dynamically adjust embeddings to capture evolving interactions. To handle the unique requirements of recommendations, LLM vocabularies are extended to

User and Item IDs: By incorporating IDs into the token vocabulary, models can directly associate embeddings with specific users and items [6];
Dynamic Tokenization: Frequent updates to embeddings enable real-time adaptation to new users or products [82].

Attention mechanisms enhance LLM4Rec by capturing dependencies between user interactions and item attributes:

Self-Attention:Captures temporal user behaviors by analyzing sequences of past interactions [6].
Cross-Attention: Combines textual metadata with user interactions, as seen in models integrating product reviews with user activity [72].

The following prompts enable LLMs to adapt to recommendation tasks without retraining:

Soft Prompts: Embedding prompts are injected into the input sequence, improving task-specific adaptability [83].
Hard Prompts:Predefined natural language queries guide LLMs in generating recommendations or explanations [69,84].

5.3. Methodologies in LLM4Rec

LLM4Rec systems employ advanced methodologies to address scalability, personalization, and interoperability. For example, pre-trained LLMs are fine-tuned for recommendation tasks.

Domain-Specific Fine-Tuning: Models like Clinical BERT have been adapted for healthcare recommendations by fine-tuning on clinical datasets [85].
Lightweight Fine-Tuning: Adapters and LoRA (low-rank adaptation) reduce computational overhead while retaining performance [86].

In addition to fine-tuning, optimizing prompts enhances LLMs for recommendations through manual prompts. A structured query, such as “What are similar products to X?”, guides the model in generating recommendations [87], and prompt tuning using gradient-based tuning of soft prompts improves performance without full model fine-tuning [88]. Moreover, mutual regularization techniques enhance the robustness of LLM4Rec through

Cross-Modality Regularization: Aligns text and metadata embeddings to improve multimodal recommendations [89].
Adaptive Aggregation: Dynamically combines the characteristics of the user and the item according to the interaction context.

To address sparsity in LLM4Rec, data augmentation is used through 1) synthetic data generation by generating reviews for underrepresented items mitigates cold-start problems or adopting masked language modeling (MLM) by pretraining tasks like ELECTRA’s replaced token detection to enable more efficient learning of recommendation patterns [44]. Lastly, multitasking learning and knowledge distillation improve scalability through

Multi-Task Frameworks: Jointly training on ranking, classification, and explanation tasks improves the model’s generalization [90].
Distillation: Compressing large models into smaller, task-specific ones, such as DistilBERT, improves scalability [91].

5.4. Performance Evaluation and Benchmarking of LLM4Rec

Evaluating and benchmarking large language models for recommendation systems (LLM4Rec) is critical for assessing their performance across diverse domains and scenarios. This section comprehensively discusses the datasets, metrics, comparative analyses, and challenges of benchmarking LLM4Rec.

5.4.1. Commonly Used Datasets and Benchmarks

The datasets used for LLM4Rec span various domains and offer structured and unstructured data. They differ in scale, granularity, and contextual richness, making them suitable for testing the adaptability of LLMs in real-world recommendation scenarios.

E-commerce and Retail: - Amazon Review Data [81]: This dataset includes user reviews, ratings, and metadata for millions of products, supporting text-based and hybrid recommendation tasks. - Taobao Dataset [92]: This is a large-scale dataset capturing user behaviors, including clicks, purchases, and reviews, often used for sequential and session-based recommendations. - AliExpress Dataset [93]: This dataset focuses on cross-border e-commerce, combining multilingual reviews with user interaction logs to evaluate cross-language recommendations.
News and Media: - Microsoft News Dataset (MIND) [94]: This set contains news articles, click behaviors, and user session data, making it a benchmark for contextualized and personalized news recommendations. - Adressa Dataset [95]: This includes user clicks and reading behaviors on Norwegian news websites, testing the multilingual capabilities of LLMs. - MIND Your Language Dataset [96]: This dataset provides multilingual news articles with user interaction data, offering content-based and cross-lingual recommendations benchmarks.
Social Media and Streaming: - MovieLens [97] features user–item movie ratings, serving as a baseline for collaborative filtering and hybrid models. - Spotify Dataset [98] captures user interactions with playlists, songs, and artists, ideal for music recommendations. - YouTube Dataset [99] offers insights into video watch behaviors, enabling sequential and content-based recommendation evaluations.
Educational Platforms: - EdNet [100] contains hierarchical data from online education platforms, enabling personalized learning pathway recommendations. - ASSISTments [101] focuses on student performance in quizzes, allowing for adaptive learning recommendations. - KDD Cup 2010 Educational Data Challenge [102] tests knowledge tracing models by evaluating student responses to educational content.
Healthcare and Lifestyle: - Synthea [103] simulates electronic health records (EHRs) with clinical notes, supporting health-related recommendations. - HealthTweets [104] consists of health-related tweets, enabling sentiment-aware lifestyle recommendations. - HeartSteps Dataset [105] tracks physical activity and contextual factors, which are useful for fitness app recommendations.

These datasets collectively test various aspects of LLMs, including their ability to handle sequential data, contextual embeddings, and domain-specific nuances.

5.4.2. Evaluation Metrics and Performance Indicators

Performance evaluation of LLM4Rec involves both traditional metrics and emerging indicators tailored to the specific challenges of large-scale models. Key metrics include

Ranking Metrics: - Precision@K and Recall@K evaluate the accuracy of the top-K recommendations. - NDCG@K (Normalized Discounted Cumulative Gain) assesses ranking quality by accounting for the positions of relevant items in the recommendation list [106].
Relevance and Diversity Metrics: - Intra-List Diversity (ILD) measures the variety of items in recommendation lists. - Coverage evaluates the system’s ability to recommend items across the entire catalog [92,107].
Engagement Metrics: - Click-Through Rate (CTR) measures the likelihood of a user clicking on recommended items. - Dwell Time indicates user satisfaction by tracking how long users interact with recommended content [97].
Contextual Metrics: - Temporal Adaptability evaluates how well recommendations evolve with changing user preferences. - Sentiment Sensitivity measures the model’s ability to align recommendations with user sentiments, especially in domains like health and wellness [104].

LLM4Rec systems excel in improving ranking metrics such as NDCG and Recall while also addressing diversity and relevance challenges.

5.5. Technical Challenges in LLM4Rec

The integration of large language models into recommender systems (LLM4Rec) introduces a set of complex challenges that span computational scalability, linguistic generalization, ethical fairness, and system responsiveness. Despite growing interest, many early claims in the literature lack empirical support or are derived from misinterpreted findings, which necessitates a more grounded discussion.

Cross-domain transfer continues to be a major obstacle. Although T5 [26] demonstrates strong performance in transfer learning for NLP, its ability to generalize across domains in recommendation settings is limited, with noticeable performance degradation on unfamiliar domains. Likewise, while multilingual models like mBERT and XLM-R offer robustness for high-resource languages, their effectiveness in low-resource settings remains questionable, as shown through experiments on datasets such as Adressa [95]. Moreover, claims often attributed to EDNet [100] lack substantiation—there is no verified evidence that this dataset addresses LLM scalability for recommendations.

Another recurring challenge is the misalignment between the objectives of language modeling and the needs of recommendation. While LLMs such as GPT-4 excel at natural language tasks, they do not inherently model the user–item interactions required for effective recommendations. BERT4Rec [6] partially addresses this gap by modeling interaction sequences as pseudo-language, but achieving high-quality outputs still demands substantial fine-tuning. Domain-specific models such as ClinicalBERT [85], although trained on rich medical text, are not optimized for structured data like lab results, further limiting generalization.

Scalability is another serious constraint. The computational load of models like GPT-3, with its 175 billion parameters [87], makes real-world deployment expensive and often infeasible. DistilBERT [91] offers a lighter alternative with faster inference, but these speed gains come at the cost of reduced performance. The environmental cost also cannot be ignored; training these models incurs a substantial carbon footprint, as highlighted by Strubell et al. [108].

Long-tail sparsity and cold-start problems continue to hinder recommendation quality. Traditional collaborative filtering approaches underperform in sparse domains [81]. Recent hybrid LLM-based strategies, such as those proposed by Ding et al. [109], aim to route queries between small and large models based on expected complexity, but they introduce additional overhead in model orchestration and inference latency.

Speed remains critical in production environments. Real-time applications, such as those operated by YouTube [99] and Spotify [98], rely on highly optimized infrastructure that can deliver responses within milliseconds. Current LLMs—even when distilled or pruned—struggle to meet these latency requirements, making them less suitable for immediate-response recommendation systems.

Bias and fairness also remain unresolved concerns. Studies have shown that models like GPT-3 can reproduce harmful stereotypes [110], and while fairness-aware approaches like counterfactual fairness [111] exist, they are not yet native to most LLM pipelines. Survey work such as that by Mehrabi et al. [112] outlines potential mitigation strategies, but operationalizing fairness remains an open technical and ethical challenge.

Lastly, transparency and privacy are key to building trust. Research by Carlini et al. [113] reveals how sensitive training data can be inadvertently exposed by generative models. Although explainability tools like SHAP and LIME [114] can provide post hoc interpretations, most LLM-based recommenders are still black boxes. Moreover, giving users agency over the recommendation process—such as feedback control or preference adjustments—is rarely incorporated into existing LLM4Rec systems.

5.6. Emerging Techniques

To solve the main issues of LLM4Rec systems, namely, computational inefficiency, lack of adaptability, multimodal perplexity, and cross-domain sparsity, researchers have come up with some distinct emerging approaches. These methods are intended to improve scalability, accuracy, personalization, and at the same time, fairness, privacy, and responsiveness in near-real time.

To increase model efficiency and lower computational burden, several approaches have surfaced. Knowledge distillation, as used in Bert’s smaller sibling, DistilBERT, shrinks large-scale LLMs whilst retaining their capabilities. This results in a 60% decrease in inference time and resource utilization [91]. Sparse-attention models like Longformer also improve the quadratic bottleneck that self-attention has with regard to longer sequence processing [115]. Federated learning also helps LLMs to be trained in a privacy-preserving manner on user devices, which also decentralizes computational workloads [116].

In an effort to increase user engagement across long durations, RL frameworks have been incorporated in LLM4Rec. Moreover, a user behavior simulation with training capabilities is available through RecSim [117]. Further, users’ feedback and interaction signals allow for the dynamic model’s evolution, which fosters continual strategy-integrating recommendation adjustments [118].

Multimodal learning is being adopted more often to improve contextual understanding. RLMRec, for instance, uses text and image alignment of embeddings for representation learning [119]. Prompt-based fusion techniques combine behavioral logs, reviews, and images of items using structured prompts [4], whereas attention-guided fusion networks incorporate hierarchical attention layers to capture inter-modal relations [120]. Though performance has improved, the alignment of representations and the cost of computing resources are still an issue. There is wide interest in Transformer-based fusion approaches [121] and context-sensitive alignment frameworks [122] focusing on these problems.

To mitigate the limitations of traditional recommenders, systems require continual learning and real-time adaptation. Structural prompting allows reactive customizable updates after user activity [4]. In RLMRec, user and item embeddings are dynamically aligned with behavioral patterns using continual learning [119]. Moreover, zero-shot ranking enables LLMs to execute new recommendation tasks with conditional prompts without needing retraining, allowing instantaneous responsiveness to new tasks [5].

The ability to generalize across domains and languages remains a focal challenge. Meta-learning frameworks such as MAML allow low-resource domains to be accessed with minimal learning supervision [123]. Domain-specific models like ClinicalBERT, which are pretrained on medical corpora, perform significantly better in healthcare recommendation tasks [85]. Cross-domain transfer occurs when item metadata is reformulated into coherent sentences serving as inputs to pretrained LLMs [124,125]. The use of cross-lingual feedback and multilingual datasets like Amazon-M2 enhances multilingual recommendation [126,127]. The degree of difficulty posed by integrating information uniformly across languages is tackled by recent alignment approaches [128].

Real-time and mobile environments demand lightweight models and edge deployment. Reduction in model size achieved through pruning and quantization is implemented by Layerwise Unified Compression (LUC) [129]. Low-power on-device inference is supported by hardware accelerators such as Expedera’s NPUs [130]. Domain-specific models offer efficient alternatives to general-purpose LLMs, for example, Codex for code and ClinicalBERT for health [85]. Furthermore, the computation costs of adapting domains is lessened through parameter-efficient tuning like LoRA (low-rank adaptation) [86].

These emerging approaches as summarised in Table 6 with a set of possible solutions in literature collectively illustrate the increasing sophistication and maturity of LLM4Rec systems. They suggest a future of hyper-personalized recommendations tailored to the user’s context and content, optimized for real-time adaption, and aligned with ethical guidelines.

6. Applications of LLM4Rec Across Domains

Large language models for recommender systems (LLM4Rec) have emerged as transformative tools for delivering personalized, context-aware recommendations across diverse domains. LLMs have demonstrated exceptional adaptability and performance in handling complex recommendation scenarios by leveraging structured and unstructured data. Their integration into e-commerce, media, education, and healthcare has revolutionized personalization.

6.1. E-Commerce and Retail

LLMs such as ChatGPT and GPT-4 can process and understand user intent from natural language queries, making them ideal for generating conversational recommendations that align closely with user preferences [134].

One notable contribution is the LLM-KERec framework, which integrates LLM-generated inferential knowledge graphs to better capture user intent transitions and handle cold-start items in dynamic retail environments [135]. By incorporating complementary domain knowledge into recommendation pipelines, these systems outperform traditional neural recommenders in both accuracy and novelty.

Additionally, LLM-PKG combines product-level language understanding with curated prompt engineering to extract structured knowledge and generate explanations for recommended items. This framework significantly enhances user trust and engagement on e-commerce platforms by making recommendations more interpretable [136].

Conversational recommender systems have also seen notable advances. Recent studies have explored LLMs as collaborative agents in pre-sales dialogues, where either the LLM or the CRS system leads, enhancing both engagement and relevance in product discovery sessions [137]. These systems leverage LLMs’ generative power while retaining task-specific knowledge for domain adaptation.

Moreover, a comprehensive survey by Xiang et al. [138] outlines how Transformer-based LLMs contribute to intelligent recommendation tasks such as sentiment-aware ranking, description generation, and multi-turn query understanding. Their versatility across diverse e-commerce applications underscores their long-term value in personalized retail environments. Table 7 presents a detailed comparison of all the models discussed.

6.2. News and Media Recommendations

In news and media recommendation, models like T5 excel at distilling lengthy articles into clear, concise summaries, helping readers quickly grasp the main points before diving deeper [26]. Taking this further, RecPrompt introduces a self-adjusting prompting strategy that fine-tunes LLM behavior to better match user preferences, leading to more relevant and engaging news recommendations [140].

Multimodal systems are also playing a growing role in this space. For instance, MM-Rec leverages both text and image content to better understand user interests, which is particularly valuable for platforms where visuals are central to the experience [141]. In parallel, generative news recommendation (GNR) systems use LLMs to stitch together related articles, offering users a richer, more coherent view of unfolding stories [142].

Geolocation-aware models are redefining personalization in news recommendation by incorporating users’ spatial contexts. Rather than relying solely on explicit location tags, recent advances leverage LLMs to infer implicit geographical cues from content itself. For example, Katz et al. [143] demonstrate how large language models, augmented with knowledge graphs, can uncover local relevance within news articles. Their findings show that such LLM-based systems significantly improve the delivery of personalized, region-specific news—particularly benefiting users like travelers or those seeking local updates. As access to location-rich signals grows, these context-driven recommendations are poised to become a core component of intelligent news delivery systems. Table 8 presents a detailed comparison of all the models discussed.

6.3. Social Media and Content Personalization

Social Media models like CoLLM [71] combine collaborative filtering with frozen LLMs using a distillation bridge, achieving strong results on sparse datasets while preserving semantic depth.

RecLLM [144] brings interactivity to recommendations through multi-turn dialogues. It adjusts responses in real time based on user feedback, using memory and retrieval modules to refine suggestions in social conversations.

SocialRec [145] incorporates sentiment and community signals to tailor content around group-level dynamics. By clustering users based on emotional tone, it improves engagement with socially aligned recommendations.

Prospect [146] reimagines recommendation as agent-to-agent interaction, allowing recommender and content agents to co-learn in decentralized setups. It supports zero-shot personalization across domains like influencer and creator platforms.

LLM-BRec [147] blends BERT-based session modeling with LLM-driven user profiling to recommend in real time. It captures short-term interests and generalizes them across sessions, improving personalization in dynamic social feeds. Table 9 presents a detailed comparison of all the models discussed.

6.4. Educational Resources and Learning Recommendations

Large language models (LLMs) have shown significant promise in educational applications by enabling personalized, adaptive, and scalable learning recommendations. TutorLLM [148] leverages retrieval-augmented generation (RAG) and knowledge tracing to generate tailored learning resources based on learner performance history. Similarly, studies on LLMs in MOOCs [149] demonstrate that prompt-tuned models can outperform traditional recommenders in course suggestions.

RecMind [80] offers agent-based recommendation capabilities, adapting to evolving learner contexts in unstructured environments. E4SRec [150], tailored for structured learning settings, excels in curriculum-aligned scenarios using BERT4Rec architecture. OpenP5 [151] enhances LLM-based feedback generation through an open-source framework, supporting fine-tuning and evaluation.

TALLRec [152] introduces a tuning-efficient recommendation pipeline to align Transformer-based models with long-term learning goals. These systems highlight the growing capacity of LLMs to support both structured and flexible learning paths.Table 10 presents a detailed comparison of all the models discussed.

6.5. Health and Lifestyle Recommendations

In health and wellness, LLMs enable personalized recommendations by synthesizing data from electronic health records, wearable devices, and user inputs.

GPT-4-based assistants recommend personalized diet plans, exercise routines, and mindfulness activities [35]. ClinicalBERT processes clinical notes and structured health data to recommend treatments, ensuring actionable insights for both patients and providers [85]. Real-time personalization with ICL dynamically adapts health recommendations based on evolving user needs [153]. XLNet4Rec [34] aligns exercise history with user goals to optimize recommendations for fitness applications, while PALR uses cross-attention mechanisms to enhance personalization [79]. CMS integrates wearable and clinical data to provide actionable healthcare recommendations [154]. Recent research demonstrates that large language models (LLMs), when infused with behavioral science frameworks such as COM-B, can provide effective conversational coaching to promote healthier lifestyles through personalized activity suggestions and empathetic dialogue [155]. LoRec emphasizes robustness in adversarial contexts, ensuring secure and consistent recommendations [156]. Table 11 presents a detailed comparison of all the models discussed.

7. Discussion

7.1. Comparative Analysis of Surveyed Works

This discussion synthesizes insights from 126 peer-reviewed and preprint studies published between 2018 and 2024. The aim is to provide a multidimensional overview of trends in LLM4Rec research, highlighting model types, domain applications, architectural paradigms, and methodological innovations.

7.1.1. Model Type Distribution

A large proportion of reviewed papers (see Table 12) explore Transformer-based models, including foundational encoders like BERT4Rec, generative GPT-style models like GPT4Rec, and emerging multimodal architectures such as CLIP and DALL·E for cross-modal fusion.

7.1.2. Domain-Specific Applications

A significant portion of LLM4Rec research is clustered in domains with dense interaction data, such as e-commerce and healthcare. However, educational and social applications are now increasingly represented due to demand for transparency, user control, and contextualization, See Table 13.

7.1.3. Paradigm Adoption Trends

This survey reveals a growing shift from traditional discriminative pipelines (ranking, CTR prediction) to generative and hybrid paradigms. This change is driven by the need for adaptive, conversational, and narrative-aware recommendation engines.

As seen in Figure 3,

Some 57% of papers use discriminative models (e.g., BERT4Rec, PALR), primarily designed for structured prediction tasks such as next-item recommendation, click-through rate (CTR) estimation, and user–item ranking. These models typically rely on supervised learning with large labeled interaction datasets. Most employ fine-tuning on domain-specific corpora (e.g., Amazon, MIND) and achieve high performance on metrics like NDCG, Recall@K, and AUC. However, their reliance on labeled training data and limited capacity for dynamic reasoning restricts adaptability across domains.
Some 35% leverage generative systems (e.g., GPT4Rec, RecMind) to handle open-ended, narrative, and conversational recommendation tasks. These models support zero-shot and few-shot learning scenarios, making them particularly useful in domains with sparse supervision or cold-start users/items. Generative models are commonly evaluated using BLEU, ROUGE, and diversity scores in addition to traditional metrics. Their strengths lie in producing personalized explanations, summarizing user history, and engaging in real-time dialogue, which are particularly valuable in media, entertainment, and education use cases.
Some 8% adopt hybrid frameworks that combine discriminative and generative reasoning (e.g., CoLLM, FLAN-Tuned Recs). These models often encode user–item interaction sequences using Transformers (e.g., BERT or XLNet), and then apply generative heads or prompt-based decoding layers to produce natural-language recommendations or justifications. Hybrid systems are also more likely to integrate multimodal data (e.g., visual content, reviews, metadata) and are evaluated with a combination of CTR, diversity, and human evaluation scores. They offer the best of both paradigms but introduce challenges in training pipeline complexity and latency.

7.1.4. Analysis and Emerging Themes

The comparative survey of LLM4Rec systems reveals several emerging patterns and thematic shifts in how large language models are being adapted for personalized recommendation. One of the most prominent developments is the growing importance of prompt engineering and instruction tuning. Models like FLAN and CoLLM exemplify this trend by enabling LLMs to perform domain-specific recommendation tasks without traditional full-scale fine-tuning. Instead, task adaptation is achieved through well-crafted natural language prompts or embedded soft prompts, allowing for rapid deployment in zero- or few-shot scenarios across diverse domains.

Another major trend is the evolution of multimodal and contextually enriched recommendation pipelines. Systems such as RLMRec and RecVAE++ demonstrate the power of integrating textual reviews, product metadata, and visual features to construct more comprehensive user–item representations. These multimodal architectures offer significant performance benefits, particularly in cold-start situations where behavioral signals are sparse. Context fusion techniques—often guided by attention mechanisms or contrastive objectives—allow models to dynamically align modality-specific signals with user preferences, enhancing both relevance and diversity.

Despite these advancements, scalability and latency remain persistent challenges. Generative systems like RecMind and GPT4Rec, although highly expressive and suitable for narrative generation, frequently struggle with the real-time responsiveness required by high-throughput applications such as Amazon or YouTube. The autoregressive nature of these models, coupled with their large parameter footprints, results in slower inference times, limiting their viability in low-latency production environments. Solutions such as model distillation, sparse activation (e.g., GLaM), and hybrid serving pipelines are beginning to address this bottleneck, but practical deployment at scale remains constrained.

The survey also uncovers significant domain gaps in current LLM4Rec research. E-commerce and healthcare dominate the literature, largely due to the availability of rich user interaction logs and text-based reviews. However, underexplored domains—such as educational technologies for low-resource regions, civic engagement platforms, and social good applications—offer untapped potential for the contextual, language-aware strengths of LLMs. These areas would particularly benefit from instruction-tuned models capable of handling sparse supervision, multilingual input, and dynamic user intent.

In summary, this work provides a cross-cutting view of LLM4Rec architectures, ranging from foundational Transformer encoders to scalable and multimodal frameworks. It benchmarks domain and paradigm trends through both quantitative and qualitative synthesis, and it highlights critical methodological gaps that can inform future research. These insights collectively demonstrate that the LLM4Rec landscape is evolving beyond traditional ranking pipelines into a broader ecosystem of explainable, adaptive, and cross-modal recommendation engines, See Table 14.

8. Conclusions and Future Research Directions

This survey offers a comprehensive and domain-agnostic synthesis of large language model-based recommender systems (LLM4Rec), encompassing architectural innovations, learning paradigms, benchmarking strategies, domain-specific deployments, and emerging challenges. We reviewed over 150 works published between 2018 and 2024, covering both peer-reviewed and preprint sources. A structured taxonomy was developed to categorize LLM4Rec systems based on model architecture (discriminative, generative, hybrid), Transformer backbones (e.g., BERT, GPT, T5), training methodologies, and application domains. Our comparative analysis highlights key advancements such as prompt engineering, instruction tuning, multimodal fusion, and retrieval-augmented generation. Simultaneously, the survey herein identifies limitations that persist in current systems, including latency bottlenecks, cold-start challenges, domain adaptation issues, and concerns related to fairness, interpretability, and data privacy. The authors contributed by designing the taxonomy, synthesizing performance benchmarks, and identifying gaps in both technical deployment and theoretical understanding.

Looking forward, future research in LLM4Rec should focus on improving model efficiency and deployability through techniques like knowledge distillation, sparse attention, and quantization. Cross-domain and cross-lingual generalization are critical for making recommendation systems more inclusive and globally accessible. Furthermore, the integration of continual learning, privacy-preserving computation (e.g., federated learning), and interpretable decision pipelines will be vital for enabling real-time, responsible, and adaptive personalization.

By consolidating the current landscape and proposing actionable insights, this work serves as a guiding reference for both academic researchers and industry practitioners aiming to advance LLM-powered recommendation systems.

Funding

This research was funded by Natural Sciences and Engineering Research Council of Canada: Discovery Grant.

Acknowledgments

I would like to express my deepest gratitude to my supervisor, Rasha Kashef, for her invaluable guidance, encouragement, and unwavering support throughout the course of this work. Her insightful feedback and expertise have been instrumental in shaping this research, and her mentorship has inspired me to strive for excellence.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Sarwar, B.; Karypis, G.; Konstan, J.; Riedl, J. Item-based collaborative filtering recommendation algorithms. In Proceedings of the 10th International Conference on World Wide Web, Hong Kong, China, 1–5 May 2001; pp. 285–295. [Google Scholar] [CrossRef]
Karypis, G. Evaluation of item-based top-N recommendation algorithms. In Proceedings of the 10th ACM International Conference on Information and Knowledge Management (CIKM), Atlanta, GA, USA, 5–10 November 2001; pp. 247–254. [Google Scholar] [CrossRef]
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, MN, USA, 2–7 June 2019; pp. 4171–4186. Available online: https://aclanthology.org/N19-1423 (accessed on 15 November 2024).
Xu, Z.; Liu, X.; Liu, Z.; Huang, Z.; Xie, X. GPT4Rec: Graph Prompt Tuning for Streaming Recommendation. arXiv 2024, arXiv:2406.08229. [Google Scholar]
Hou, Y.; Zhang, S.; Lin, T.; Yu, W. Large Language Models are Zero-Shot Rankers for Recommender Systems. arXiv 2023, arXiv:2305.08845. [Google Scholar]
Sun, F.; Liu, J.; Wu, J.; Pei, C.; Lin, X.; Ou, W.; Jiang, P. BERT4Rec: Sequential Recommendation with Bidirectional Encoder Representations from Transformer. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management (CIKM), Beijing, China, 3–7 November 2019; pp. 1441–1450. [Google Scholar] [CrossRef]
Liu, C.; Li, W.; Zhang, Y.; Li, H.; Ji, R. Beyond Inter-Item Relations: Dynamic Adaptive Mixture-of-Experts for LLM-Based Sequential Recommendation. arXiv 2024, arXiv:2408.07427. [Google Scholar]
Zhang, S.; Yao, L.; Sun, A.; Tay, Y. Deep Learning Based Recommender System: A Survey and New Perspectives. ACM Comput. Surv. 2019, 52, 1–38. [Google Scholar] [CrossRef]
Liang, D.; Krishnan, R.G.; Hoffman, M.D.; Jebara, T. Variational Autoencoders for Collaborative Filtering. In Proceedings of the 2018 World Wide Web Conference, Lyon, France, 23–27 April 2018; pp. 689–698. [Google Scholar] [CrossRef]
Rendle, S.; Freudenthaler, C.; Gantner, Z.; Schmidt-Thieme, L. BPR: Bayesian Personalized Ranking from Implicit Feedback. In Proceedings of the 25th Conference on Uncertainty in Artificial Intelligence (UAI), Montreal, QC, Canada, 18–21 June 2009; pp. 452–461. Available online: https://arxiv.org/abs/1205.2618 (accessed on 15 November 2024).
Parisi, G.I.; Kemker, R.; Part, J.L.; Kanan, C.; Wermter, S. Continual lifelong learning with neural networks: A review. Neural Netw. 2019, 113, 54–71. [Google Scholar] [CrossRef]
Wu, L.; Zheng, Z.; Qiu, Z.; Wang, H.; Gu, H.; Shen, T.; Qin, C.; Zhu, C.; Zhu, H.; Liu, Q.; et al. A Survey on Large Language Models for Recommendation. arXiv 2023, arXiv:2305.19860. [Google Scholar] [CrossRef]
Resnick, P.; Varian, H.R. Recommender Systems. Commun. ACM 1997, 40, 56–58. [Google Scholar] [CrossRef]
Adomavicius, G.; Tuzhilin, A. Toward the Next Generation of Recommender Systems: A Survey of the State-of-the-Art and Possible Extensions. IEEE Trans. Knowl. Data Eng. 2005, 17, 734–749. [Google Scholar] [CrossRef]
Burke, R. Hybrid Recommender Systems: Survey and Experiments. User Model. User Adapt. Interact. 2002, 12, 331–370. [Google Scholar] [CrossRef]
Koren, Y.; Bell, R.; Volinsky, C. Matrix Factorization Techniques for Recommender Systems. Computer 2009, 42, 30–37. [Google Scholar] [CrossRef]
Min, B.; Ross, H.; Sulem, E.; Pouran Ben Veyseh, A.; Nguyen, T.H.; Sainz, O.; Agirre, E.; Heintz, I.; Roth, D. Recent Advances in Natural Language Processing via Large Pre-trained Language Models: A Survey. ACM Comput. Surv. 2023, 56, 1–40. [Google Scholar] [CrossRef]
Alyafeai, Z.; AlShaibani, M.S.; Ahmad, I. A Survey on Transfer Learning in Natural Language Processing. arXiv 2020, arXiv:2007.04239. [Google Scholar]
Putelli, L.; Gerevini, A.E.; Lavelli, A.; Mehmood, T.; Serina, I. On the Behaviour of BERT’s Attention for the Classification of Medical Reports. In Proceedings of the 3rd Italian Workshop on Explainable Artificial Intelligence (XAI.it 2022), Udine, Italy, 28 November 2022; Volume 3277, pp. 16–30. Available online: http://ceur-ws.org/Vol-3277/paper2.pdf (accessed on 15 November 2024).
Villatoro-Tello, E.; Parida, S.; Kumar, S.; Ghosh, S.; Solorio, T.; González, F.A.; Solano, L.R.; Molina, A.; López, A.; Villaseñor, L.; et al. Applying Attention-Based Models for Detecting Cognitive Processes and Mental Health Conditions. Cogn. Comput. 2021, 13, 1154–1171. [Google Scholar] [CrossRef] [PubMed]
Eang, C.; Lee, S. Improving the Accuracy and Effectiveness of Text Classification Based on the Integration of the BERT Model and a Recurrent Neural Network (RNN_Bert_Based). Appl. Sci. 2024, 14, 8388. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is All You Need. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Volume 30. Available online: https://papers.nips.cc/paper_files/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html (accessed on 15 November 2024).
Rathje, S.; Mirea, D.M.; Sucholutsky, I.; Marjieh, R.; Robertson, C.E.; Van Bavel, J.J. GPT is an effective tool for multilingual psychological text analysis. Proc. Natl. Acad. Sci. USA 2024, 121, e2308950121. [Google Scholar] [CrossRef]
Cai, Y.; Deng, Q.; Zhou, Y. Impact of GPT on the Academic Ecosystem. Sci. Educ. 2025, 34, 913–931. [Google Scholar] [CrossRef]
Hua, S.; Jin, S.; Jiang, S. The Limitations and Ethical Considerations of ChatGPT. Data Intell. 2024, 6, 201–239. [Google Scholar] [CrossRef]
Raffel, C.; Shazeer, N.; Roberts, A.; Lee, K.; Narang, S.; Matena, M.; Zhou, Y.; Li, W.; Liu, P.J.; Bosma, M.; et al. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. 2020, 21, 1–67. [Google Scholar]
Fuadi, M.; Wibawa, A.D.; Sumpeno, S. Adaptation of Multilingual T5 Transformer for Indonesian Language. In Proceedings of the 2023 IEEE 9th Information Technology International Seminar (ITIS), Surabaya, Indonesia, 6–7 October 2023; pp. 1–6. [Google Scholar] [CrossRef]
Zayyanu, Z.M. Revolutionising Translation Technology: A Comparative Study of Variant Transformer Models—BERT, GPT, and T5. Comput. Sci. Eng. Int. J. 2024, 14, 15–27. [Google Scholar] [CrossRef]
Tay, Y.; Dehghani, M.; Rao, J.; Fedus, W.; Abnar, S.; Chung, H.W.; Narang, S.; Yogatama, D.; Vaswani, A.; Metzler, D. Scale Efficiently: Insights from Pre-training and Fine-tuning Transformers. arXiv 2021, arXiv:2109.10686. [Google Scholar]
Fields, J.; Chovanec, K.; Madiraju, P. A Survey of Text Classification with Transformers: How Wide? How Large? How Long? How Accurate? How Expensive? How Safe? IEEE Access 2024, 12, 22860–22878. [Google Scholar] [CrossRef]
Yang, Z.; Dai, Z.; Yang, Y.; Carbonell, J.; Salakhutdinov, R.; Le, Q.V. XLNet: Generalized Autoregressive Pretraining for Language Understanding. arXiv 2019, arXiv:1906.08237. [Google Scholar]
Moreira, G.D.S.P.; Rabhi, S.; Lee, J.M.; Ak, R.; Oldridge, E. Transformers4Rec: Bridging the Gap between NLP and Sequential/Session-Based Recommendation. In Proceedings of the 15th ACM Conference on Recommender Systems (RecSys ’21), Amsterdam, The Netherlands, 27 September–1 October 2021; pp. 143–153. [Google Scholar] [CrossRef]
Wang, Y.; Zheng, J.; Li, Q.; Wang, C.; Zhang, H.; Gong, J. XLNet-Caps: Personality Classification from Textual Posts. Electronics 2021, 10, 1360. [Google Scholar] [CrossRef]
Vij, N.; Yacoub, A.; Kobti, Z. XLNet4Rec: Recommendations Based on Users’ Long-Term and Short-Term Interests Using Transformer. In Proceedings of the 2023 International Conference on Machine Learning and Applications (ICMLA), Jacksonville, FL, USA, 15–17 December 2023; pp. 647–652. [Google Scholar] [CrossRef]
Zhang, H.; Shafiq, M.O. Survey of Transformers and Towards Ensemble Learning Using Transformers for Natural Language Processing. J. Big Data 2024, 11, 25. [Google Scholar] [CrossRef] [PubMed]
Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; Stoyanov, V. RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv 2019, arXiv:1907.11692. [Google Scholar]
Abhishek, K. News Article Classification using a Transfer Learning Approach. In Proceedings of the 2022 10th International Conference on Reliability, Infocom Technologies and Optimization (Trends and Future Directions) (ICRITO), Noida, India, 13–14 October 2022; pp. 1–6. [Google Scholar] [CrossRef]
Cai, F.; Ye, H. Chinese Medical Text Classification with RoBERTa. In Proceedings of the Biomedical and Computational Biology (BECB 2022), Kyoto, Japan, 15–17 December 2022; Volume 13637, pp. 223–236. [Google Scholar] [CrossRef]
Lan, Z.; Chen, M.; Goodman, S.; Gimpel, K.; Sharma, P.; Soricut, R. ALBERT: A Lite BERT for Self-supervised Learning of Language Representations. arXiv 2019, arXiv:1909.11942. [Google Scholar]
Choi, H.; Kim, J.; Joe, S.; Gwon, Y. Evaluation of BERT and ALBERT Sentence Embedding Performance on Downstream NLP Tasks. arXiv 2021, arXiv:2101.10642. [Google Scholar]
Abdal, M.N.; Oshie, M.H.K.; Haque, M.A.; Rahman, S. A Robust Model for Effective Spam Detection Based on ALBERT. In Proceedings of the 2023 6th International Conference on Electrical Information and Communication Technology (EICT), Khulna, Bangladesh, 22–24 December 2023; pp. 1–6. [Google Scholar] [CrossRef]
Zhang, X.; Ma, Y. An ALBERT-based TextCNN-Hatt hybrid model enhanced with topic knowledge for sentiment analysis of sudden-onset disasters. Eng. Appl. Artif. Intell. 2023, 123, 106136. [Google Scholar] [CrossRef]
Petridis, C. Text Classification: Neural Networks VS Machine Learning Models VS Pre-trained Models. arXiv 2024, arXiv:2412.21022. [Google Scholar]
Clark, K.; Luong, M.T.; Le, Q.V.; Manning, C.D. ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators. arXiv 2020, arXiv:2003.10555. [Google Scholar]
Ni, S.; Kao, H.Y. ELECTRA is a Zero-Shot Learner, Too. arXiv 2022, arXiv:2207.08141. [Google Scholar]
Hao, Y.; Dong, L.; Bao, H.; Xu, K.; Wei, F. Learning to Sample Replacements for ELECTRA Pre-Training. arXiv 2021, arXiv:2106.13715. [Google Scholar]
Lewis, M.; Liu, Y.; Goyal, N.; Ghazvininejad, M.; Mohamed, A.; Levy, O.; Stoyanov, V.; Zettlemoyer, L. BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL 2020). Association for Computational Linguistics, Online, 5–10 July 2020; pp. 7871–7880. [Google Scholar] [CrossRef]
Vinoda, D.; Yadav, P.K. SDBERT: SparseDistilBERT, a faster and smaller BERT model. arXiv 2022, arXiv:2208.10246. [Google Scholar]
Kumar, S.; Deep, S.; Kalra, P. Enhancing Customer Service in Banking with AI: Intent Classification Using DistilBERT. Int. J. Curr. Sci. Res. Rev. 2024, 7, 2706–2713. [Google Scholar] [CrossRef]
Benselloua, A.Y.M.; Messadi, S.A. Effective Malicious PowerShell Scripts Detection Using DistilBERT. In Proceedings of the 2023 IEEE Afro-Mediterranean Conference on Artificial Intelligence (AMCAI), Constantine, Algeria, 20–21 December 2023; pp. 166–171. [Google Scholar] [CrossRef]
Kusal, S.; Patil, S.; Gupta, A.; Saple, H.; Jaiswal, D.; Deshpande, V.; Kotecha, K. Sentiment Analysis of Product Reviews Using Deep Learning and Transformer Models: A Comparative Study. In Proceedings of the Artificial Intelligence: Theory and Applications, Pune, India, 12–14 March 2024; Volume 843, pp. 195–208. [Google Scholar] [CrossRef]
Salmony, M.Y.A.; Faridi, A.R. Bert Distillation to Enhance the Performance of Machine Learning Models for Sentiment Analysis on Movie Review Data. In Proceedings of the 2022 9th International Conference on Computing for Sustainable Global Development, New Delhi, India, 23–25 March 2022; pp. 400–405. [Google Scholar] [CrossRef]
Narayanan, D.; Shoeybi, M.; Casper, J.; Patwary, M.; LeGresley, P.; Korthikanti, V.; Rasley, J.; Rajbhandari, S.; Ruwase, O.; Zadeh, A.Y.; et al. Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM. arXiv 2021, arXiv:2104.04473. [Google Scholar]
Smith, S.P.; Patwary, M.; Norick, B.; LeGresley, P.; Rajbhandari, S.; Casper, J.; Liu, Z.; Prabhumoye, S.; Zoph, B.; Shoeybi, M.; et al. Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model. arXiv 2022, arXiv:2201.11990. [Google Scholar]
Xu, P.; Zhang, L.; Zhang, Y.; Wang, Y.; Gan, Z.; Henao, R.; Carin, L.; Wang, J.; Bansal, M.; Carin, L.; et al. MEGATRON-CNTRL: Controllable Story Generation with External Knowledge Using Large-Scale Language Models. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, Online, 16–20 November 2020; pp. 2831–2845. Available online: https://aclanthology.org/2020.emnlp-main.226/ (accessed on 15 January 2025).
Yang, S.; Li, X.; Liu, C.; Song, K.; Zhang, Y.; Wu, J.; Wang, Z.; Zhang, M.; Chen, Q.; Xu, Y.; et al. Common Sense Enhanced Knowledge-Based Recommendation with Large Language Model. In Proceedings of the 29th International Conference on Database Systems for Advanced Applications (DASFAA 2024), Taipei, Taiwan, 21–24 May 2024; Volume 14854, pp. 406–421. [Google Scholar] [CrossRef]
Zheng, Z.; Chao, W.; Qiu, Z.; Zhu, H.; Xiong, H. Harnessing Large Language Models for Text-Rich Sequential Recommendation. arXiv 2024, arXiv:2403.13325. [Google Scholar]
Sun, Y.; Wang, S.; Li, Y.; Feng, S.; Chen, X.; Zhang, H.; Tian, X.; Zhu, D.; Tian, H.; Wu, H. ERNIE: Enhanced Representation through Knowledge Integration. arXiv 2019, arXiv:1904.09223. [Google Scholar]
Hou, Y.; Fu, G.; Sachan, M. Understanding Knowledge Integration in Language Models with Graph Convolutions. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2022, Abu Dhabi, United Arab Emirates, 7–11 December 2022; pp. 1374–1386. Available online: https://aclanthology.org/2022.findings-emnlp.102.pdf (accessed on 15 January 2025).
Wei, J. Finetuned Language Models Are Zero-Shot Learners. arXiv 2021, arXiv:2109.01652. [Google Scholar]
He, P. DeBERTa: Decoding-enhanced BERT with Disentangled Attention. arXiv 2021, arXiv:2006.03654. [Google Scholar]
Dong, L.; Yang, N.; Wang, W.; Wei, F.; Liu, X.; Wang, Y.; Gao, J.; Zhou, M.; Hon, H.W. A Unified Language Model Pre-training for Natural Language Understanding and Generation. arXiv 2019, arXiv:1905.03197. [Google Scholar]
Keskar, N.S.; McCann, B.; Varshney, L.R.; Xiong, C.; Socher, R. CTRL: A Conditional Transformer Language Model for Controllable Generation. arXiv 2019, arXiv:1909.05858. [Google Scholar]
Thoppilan, R.; De Freitas, D.; Hall, J.; Shazeer, N.; Kulshreshtha, A.; Cheng, H.T.; Jin, A.; Bos, T.; Baker, L.; Du, Y.; et al. LaMDA: Language Models for Dialog Applications. arXiv 2022, arXiv:2201.08239. [Google Scholar]
Du, N.; Rao, A.; Kurian, J.; Catasta, M.; Hou, L.; Al-Rfou, R.; Xu, Y.; Chen, Z.; Narang, S.; Dai, Z.; et al. GLaM: Efficient Scaling of Language Models with Sparse Mixture-of-Experts. arXiv 2022, arXiv:2112.06905. [Google Scholar] [CrossRef]
Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning Transferable Visual Models From Natural Language Supervision. arXiv 2021, arXiv:2103.00020. [Google Scholar] [CrossRef]
Ramesh, A.; Pavlov, M.; Goh, G.; Gray, S.; Voss, C.; Radford, A.; Chen, M.; Sutskever, I. Zero-Shot Text-to-Image Generation. arXiv 2021, arXiv:2102.12092. [Google Scholar] [CrossRef]
Radford, A.; Narasimhan, K. Improving Language Understanding by Generative Pre-Training. OpenAI Preprint. 2018. Available online: https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf (accessed on 15 January 2025).
Zhang, Y.; Jin, Y. Navigating User Experience of ChatGPT-based Conversational Recommender Systems: The Effects of Prompt Guidance and Recommendation Domain. arXiv 2024, arXiv:2405.13560. [Google Scholar]
Fang, H.; Xu, G.; Long, Y.; Tang, W. An Effective ELECTRA-Based Pipeline for Sentiment Analysis of Tourist Attraction Reviews. Appl. Sci. 2022, 12, 10881. [Google Scholar] [CrossRef]
Zhang, Y.; Feng, Y.; He, X.; Li, Y.; Lu, P.; Shi, C.; Liang, Y.; Zhang, H.; Hu, Y.; Liu, Y.; et al. CoLLM: Integrating Collaborative Embeddings into Large Language Models for Recommendation. arXiv 2023, arXiv:2310.19488. [Google Scholar] [CrossRef]
Wang, M.; Hu, X.; Du, Y. Enhancing Recommender Systems Performance using Knowledge Graph Embedding with Graph Neural Networks. In Proceedings of the 2024 4th International Conference on Neural Networks, Information and Communication Engineering (NNICE), Tokyo, Japan, 18–20 March 2024; pp. 1–6. [Google Scholar] [CrossRef]
Papineni, K.; Roukos, S.; Ward, T.; Zhu, W.J. BLEU: A Method for Automatic Evaluation of Machine Translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, PA, USA, 7–12 July 2002; pp. 311–318. Available online: https://aclanthology.org/P02-1040/ (accessed on 15 January 2025).
Lin, C.Y. ROUGE: A Package for Automatic Evaluation of Summaries. In Proceedings of the Text Summarization Branches Out: Proceedings of the ACL-04 Workshop, Barcelona, Spain, 25 July 2004; pp. 74–81. Available online: https://aclanthology.org/W04-1013/ (accessed on 15 January 2025).
Abeysiriwardana, M.; Sumanathilaka, D. A Survey on Lexical Ambiguity Detection and Word Sense Disambiguation. arXiv 2024, arXiv:2403.16129. [Google Scholar]
Hu, J.; Xia, M.; Neubig, G.; Carbonell, J. Domain Adaptation of Neural Machine Translation by Lexicon Induction. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 28 July–2 August 2019; pp. 298–304. Available online: https://arxiv.org/abs/1906.00376 (accessed on 15 January 2025).
Zhang, S.; Yao, L.; Sun, A.; Tay, Y. A Survey on Deep Learning-based Recommender Systems: From Collaborative Filtering to Content and Knowledge Aware Recommendation. IEEE Trans. Knowl. Data Eng. 2022, 34, 249–270. [Google Scholar] [CrossRef]
Li, H.; Wang, Q.; Liu, Y.; Yu, P.S. UPRec: User-aware Pre-training for Sequential Recommendation. AI Open 2023, 4, 137–144. [Google Scholar] [CrossRef]
Yang, F.; Chen, Z.; Jiang, Z.; Cho, E.; Huang, X.; Lu, Y. PALR: Personalization Aware LLMs for Recommendation. In Proceedings of the First Workshop on Generative Information Retrieval (Gen-IR) at SIGIR, Taipei, Taiwan, 23 July 2023; Available online: https://arxiv.org/abs/2305.07622 (accessed on 15 January 2025).
Wang, Y.; Jiang, Z.; Chen, Z.; Yang, F.; Zhou, Y.; Cho, E.; Fan, X.; Huang, X.; Lu, Y.; Yang, Y. RecMind: Large Language Model Powered Agent For Recommendation. In Proceedings of the Findings of the Association for Computational Linguistics: NAACL 2024, Mexico City, Mexico, 16–21 June 2024; pp. 4351–4364. [Google Scholar]
McAuley, J.; Pandey, R.; Leskovec, J. Inferring Networks of Substitutable and Complementary Products. In Proceedings of the 21st ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), Sydney, Australia, 10–13 August 2015; pp. 785–794. [Google Scholar] [CrossRef]
Guo, L.; Jin, J.; Zhang, H.; Zheng, Z.; Yang, Z.; Xing, Z.; Pan, F.; Niu, L.; Wu, F.; Xu, H.; et al. We Know What You Want: An Advertising Strategy Recommender System for Online Advertising. arXiv 2021, arXiv:2105.14188. [Google Scholar]
Li, X.; Liang, P. Prefix-Tuning: Optimizing Continuous Prompts for Generation Tasks. In Proceedings of the ACL, Online, 1–6 August 2021; pp. 4583–4597. Available online: https://arxiv.org/abs/2101.00190 (accessed on 15 January 2025).
Di Palma, D.; Biancofiore, G.M.; Anelli, V.W.; Narducci, F.; Di Noia, T.; Di Sciascio, E. Evaluating ChatGPT as a Recommender System: A Rigorous Approach. arXiv 2023, arXiv:2309.03613. [Google Scholar]
Huang, K.; Altosaar, J.; Ranganath, R. ClinicalBERT: Modeling Clinical Notes and Predicting Hospital Readmission. arXiv 2019, arXiv:1904.05342. [Google Scholar]
Hu, E.; Shen, Y.; Wallis, P.; Allen-Zhu, Z.; Li, Y.; Wang, L.; Wang, H.; Rajan, A.; Chen, W.; Liu, W.; et al. LoRA: Low-rank adaptation of large language models. In Proceedings of the International Conference on Learning Representations (ICLR), Virtual Event, 3–7 May 2021; Available online: https://arxiv.org/abs/2106.09685 (accessed on 15 January 2025).
Brown, T.B.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language Models are Few-Shot Learners. arXiv 2020, arXiv:2005.14165. [Google Scholar]
Lester, B.; Al-Rfou, R.; Constant, N. The power of scale for parameter-efficient prompt tuning. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), Punta Cana, Dominican Republic, 7–11 November 2021; pp. 3045–3059. [Google Scholar]
Zhao, Y.; He, X.; Wang, X.; Li, M.; Chua, T.S. Multi-modal Recommendations: Aligning Text and Metadata Embeddings. In Proceedings of the ACM Recommender Systems Conference (RecSys), Copenhagen, Denmark, 16–20 September 2019; Available online: https://example.com/multi-modal-recommendations (accessed on 15 January 2025).
Caruana, R. Multitask learning. Mach. Learn. 1997, 28, 41–75. [Google Scholar] [CrossRef]
Sanh, V.; Debut, L.; Chaumond, J.; Wolf, T. DistilBERT, a distilled version of BERT: Smaller, faster, cheaper and lighter. arXiv 2019, arXiv:1910.01108. [Google Scholar]
User Behavior Data from Taobao for Recommendation. Available online: https://tianchi.aliyun.com/dataset/649 (accessed on 15 January 2025).
Alibaba E-Commerce User Behavior Dataset. 2020. Available online: https://yongfeng.me/dataset/ (accessed on 2 September 2024).
Wu, F.; Qiao, Y.; Chen, J.-H.; Wu, C.; Qi, T.; Lian, J.; Liu, D.; Xie, X.; Gao, J.; Wu, W.; et al. MIND: A large-scale dataset for news recommendation. In Proceedings of the ACL, Online, 6–8 July 2020; pp. 3597–3606. Available online: https://aclanthology.org/2020.acl-main.331/ (accessed on 15 January 2025).
Gulla, J.A.; Zhang, L.; Liu, P.; Özgöbek, Ö.; Su, X. The Adressa dataset for news recommendation. In Proceedings of the International Conference on Web Intelligence, Leipzig, Germany, 23–26 August 2017; pp. 1042–1048. [Google Scholar] [CrossRef]
Iana, A.; Glavaš, G.; Paulheim, H. MIND Your Language: A Multilingual Dataset for Cross-Lingual News Recommendation. In Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2024), Washington, DC, USA, 14–18 July 2024; pp. 553–563. [Google Scholar] [CrossRef]
Harper, F.; Konstan, J. The MovieLens datasets: History and contexts. ACM Trans. Interact. Intell. Syst. 2016, 5, 19–29. [Google Scholar] [CrossRef]
Brost, B.; Mehrotra, R.; Niedermayer, T.; Li, C.; McInerney, J.; Bouchard, H.; Lalmas, M.; Pike, M. The Music Streaming Sessions Dataset. In Proceedings of the 13th ACM Conference on Recommender Systems, Copenhagen, Denmark, 16–20 September 2019; pp. 328–332. [Google Scholar] [CrossRef]
Covington, P.; Adams, J.; Sargin, E. Deep Neural Networks for YouTube Recommendations. In Proceedings of the 10th ACM Conference on Recommender Systems (RecSys), Boston, MA, USA, 15–19 September 2016; pp. 191–198. [Google Scholar] [CrossRef]
Choi, Y.; Lee, Y.; Shin, D.; Cho, J.; Park, S.; Lee, S.; Baek, J.; Bae, C.; Kim, B.; Heo, J. EdNet: A Large-Scale Hierarchical Dataset in Education. In Proceedings of the 21st Int. Conf. Artificial Intelligence in Education (AIED), Ifrane, Morocco, 6–10 July 2020; pp. 69–73. Available online: https://arxiv.org/abs/1912.03072 (accessed on 15 January 2025).
Feng, M.; Heffernan, N.; Koedinger, K. Addressing the ASSISTments Dataset: Analyzing Student Performance for Adaptive Learning Recommendations. In Proceedings of the AIED, Brighton, UK, 6–10 July 2009. [Google Scholar]
Barnes, T.; Stamper, J.; Feng, M. KDD Cup 2010: Educational Data Mining Challenge for Knowledge Tracing Models. In Proceedings of the KDD Cup, Washington, DC, USA, 25 July 2010; Available online: https://pslcdatashop.web.cmu.edu/KDDCup/ (accessed on 15 January 2025).
Walonoski, T.; Walonoski, J.; Kramer, M.; Nichols, J.; Quina, A.; Moesel, C.; Hall, D.; Duffett, C.; Dube, K.; Gallagher, T.; et al. Synthea: Simulated Electronic Health Records Supporting Health-Related Recommendations. J. Am. Med. Inform. Assoc. 2018, 25, 230–237. [Google Scholar] [CrossRef]
Karisani, N.; Agichtein, E. Did you mean lung cancer? Identifying health-related misinformation in social media. In Proceedings of the SIGIR, Ann Arbor, MI, USA, 8–12 July 2018. [Google Scholar] [CrossRef]
Tewari, A.; Murphy, S.A.; Nahum-Shani, I. Personalized HeartSteps: A Reinforcement Learning Algorithm for Optimizing Physical Activity. Proc. ACM Interact. Mob. Wearable Ubiquitous Technol. 2021, 5, 1–24. [Google Scholar] [CrossRef]
Järvelin, K.; Kekäläinen, J. Cumulated gain-based evaluation of IR techniques. ACM Trans. Inf. Syst. 2002, 20, 422–446. [Google Scholar] [CrossRef]
Xie, X.; Sun, F.; Yang, X.; Yang, Z.; Gao, J.; Ou, W.; Cui, B. Explore User Neighborhood for Real-time E-commerce Recommendation. arXiv 2021, arXiv:2103.00442. [Google Scholar]
Strubell, E.; Ganesh, A.; McCallum, A. Energy and Policy Considerations for Deep Learning in NLP. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 28 July–2 August 2019; pp. 3645–3650. Available online: https://aclanthology.org/P19-1355/ (accessed on 15 January 2025).
Ding, D.; Mallick, A.; Wang, C.; Sim, R.; Mukherjee, S.; Rühle, V.; Lakshmanan, L.V.S.; Awadallah, A.H. Hybrid LLM: Cost-Efficient and Quality-Aware Query Routing. arXiv 2024, arXiv:2404.14618. [Google Scholar]
Bolukbasi, T.; Chang, K.W.; Zou, J.; Saligrama, V.; Kalai, A. Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings. arXiv 2016, arXiv:1607.06520. [Google Scholar]
Kusner, M.J.; Loftus, J.; Russell, C.; Silva, R. Counterfactual fairness. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Volume 30, pp. 4066–4076. Available online: https://proceedings.neurips.cc/paper_files/paper/2017/hash/a486cd07e4ac3d270571622f4f316ec5-Abstract.html (accessed on 15 January 2025).
Mehrabi, N.; Morstatter, F.; Saxena, N.; Lerman, K.; Galstyan, A. A Survey on Bias and Fairness in Machine Learning. ACM Comput. Surv. 2021, 54, 1–35. [Google Scholar] [CrossRef]
Carlini, N.; Tramer, F.; Wallace, E.; Jagielski, M.; Herbert-Voss, A.; Lee, K.; Roberts, A.; Brown, T.; Song, D.; Papernot, N.; et al. Extracting Training Data from Large Language Models. In Proceedings of the 30th USENIX Security Symposium, Virtual Event, 11–13 August 2021; pp. 2633–2650. Available online: https://www.usenix.org/conference/usenixsecurity21/presentation/carlini-extracting (accessed on 15 January 2025).
Lundberg, S.; Lee, S.I. A Unified Approach to Interpreting Model Predictions. Adv. Neural Inf. Process. Syst. 2017, 30, 4765–4774. Available online: https://proceedings.neurips.cc/paper_files/paper/2017/hash/8a20a8621978632d76c43dfd28b67767-Abstract.html (accessed on 1 January 2025).
Beltagy, I.; Peters, M.E.; Cohan, A. Longformer: The Long-Document Transformer. arXiv 2020, arXiv:2004.05150. [Google Scholar]
Sahu, A.; Talwalkar, A.; Smith, V. Federated learning: Challenges, methods, and future directions. IEEE Signal Process. Mag. 2020, 37, 50–60. [Google Scholar] [CrossRef]
Ie, E.; Hsu, C.W.; Mladenov, M.; Jain, V.; Narvekar, S.; Wang, J.; Wu, R.; Boutilier, C. RecSim: A Configurable Simulation Platform for Recommender Systems. arXiv 2019, arXiv:1909.04847. [Google Scholar]
Zhao, X.; Xia, L.; Zhang, L.; Ding, Z.; Yin, D.; Tang, J. Deep Reinforcement Learning for Page-wise Recommendations. In Proceedings of the 12th ACM Conference on Recommender Systems (RecSys), Vancouver, BC, Canada, 2–7 October 2018; pp. 95–103. [Google Scholar] [CrossRef]
Ren, X.; Wei, W.; Xia, L.; Su, L.; Cheng, S.; Wang, J.; Yin, D.; Huang, C. Representation Learning with Large Language Models for Recommendation. arXiv 2023, arXiv:2310.15950. [Google Scholar]
Zhou, Y.; Guo, J.; Sun, H.; Song, B.; Yu, F.R. Attention-guided Multi-step Fusion: A Hierarchical Fusion Network for Multimodal Recommendation. arXiv 2023, arXiv:2304.11979. [Google Scholar]
Manzoor, M.A.; Albarri, S.; Xian, Z.; Meng, Z.; Nakov, P.; Liang, S. Multimodality Representation Learning: A Survey on Evolution, Pretraining and Its Applications. ACM Trans. Multimed. Comput. Commun. Appl. 2023, 20, 74. [Google Scholar] [CrossRef]
Faye, A.; Lebbah, M.; Bouchaffara, D. Lightweight Cross-Modal Representation Learning. arXiv 2024, arXiv:2403.04650. [Google Scholar]
Finn, C.; Abbeel, P.; Levine, S. Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks. In Proceedings of the ICML, Sydney, Australia, 6–11 August 2017; pp. 1126–1135. Available online: https://proceedings.mlr.press/v70/finn17a.html (accessed on 15 January 2025).
Liu, X.; Wang, R.; Sun, D.; Hakkani-Tur, D.; Abdelzaher, T. Uncovering Cross-Domain Recommendation Ability of Large Language Models. In Proceedings of the Companion ACM Web Conference 2025 (WWW Companion ’25), Sydney, NSW, Australia, 6–10 April 2025. [Google Scholar]
Kolb, T.E. Enhancing Cross-Domain Recommender Systems with LLMs: Evaluating Bias and Beyond-Accuracy Measures. In Proceedings of the 18th ACM Conference on Recommender Systems (RecSys ’24), Bari, Italy, 14–18 October 2024. [Google Scholar]
Lai, W.; Mesgar, M.; Fraser, A. Scaling the Multilingual Capability of LLMs with Cross-Lingual Feedback. In Proceedings of the Findings of the Association for Computational Linguistics: ACL 2024, Bangkok, Thailand, 12–17 August 2024. [Google Scholar]
Jin, W.; Mao, H.; Li, Z.; Jiang, H.; Luo, C.; Wen, H.; Han, H.; Lu, H.; Wang, Z.; Li, R.; et al. Amazon-M2: A Multilingual Multi-locale Shopping Session Dataset for Recommendation and Text Generation. arXiv 2023, arXiv:2307.09688. [Google Scholar]
Huang, Y.; Fan, C.; Li, Y.; Wu, S.; Zhou, T.; Zhang, X.; Sun, L. 1+1>2: Can Large Language Models Serve as Cross-Lingual Knowledge Aggregators? In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP), Miami, FL, USA, 6–10 November 2024. [Google Scholar]
Yu, Z.; Wang, H.; Li, T. EDGE-LLM: Enabling Efficient Large Language Model Adaptation on Edge Devices via Layerwise Unified Compression and Adaptive Layer Tuning and Voting. arXiv 2024, arXiv:2406.15758. [Google Scholar]
Expedera. Expedera NPUs Run Large Language Models Natively on Edge Devices. Expedera Blog 2024. Available online: https://www.expedera.com/blog/2024/01/08/expedera-npus-run-large-language-models-natively-on-edge-devices/ (accessed on 1 March 2025).
Binns, R.; Veale, M.; Van Kleek, M.; Shadbolt, N. ‘It’s Reducing a Human Being to a Percentage’: Perceptions of Justice in Algorithmic Decisions. In Proceedings of the CHI, Montreal, QC, Canada, 21–26 April 2018. [Google Scholar] [CrossRef]
Pariser, E. The Filter Bubble: What the Internet Is Hiding from You; Penguin Press: New York, NY, USA, 2011. [Google Scholar]
Jobin, A.; Ienca, M.; Vayena, E. The global landscape of AI ethics guidelines. Nat. Mach. Intell. 2019, 1, 389–399. Available online: https://www.nature.com/articles/s42256-019-0088-2 (accessed on 15 January 2025). [CrossRef]
Xu, W.; Xiao, J.; Chen, J. Leveraging Large Language Models to Enhance Personalized Recommendations in E-commerce. arXiv 2024, arXiv:2410.12829. [Google Scholar]
Zhao, Q.; Qian, H.; Liu, Z.; Zhang, G.D.; Gu, L. Breaking the Barrier: Utilizing Large Language Models for Industrial Recommendation Systems through an Inferential Knowledge Graph. arXiv 2024, arXiv:2402.13750. [Google Scholar]
Wang, M.; Guo, Y.; Zhang, D.; Jin, J.; Li, M.; Schonfeld, D.; Zhou, S. Enabling Explainable Recommendation in E-commerce with LLM-powered Product Knowledge Graph. arXiv 2024, arXiv:2412.01837. [Google Scholar]
Liu, Y.; Hu, M.; Sun, Y.; Ren, Y.; Tang, J. Conversational Recommender System and Large Language Model Are Made for Each Other in E-commerce Pre-sales Dialogue. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2023, Singapore, 6–10 December 2023; Available online: https://aclanthology.org/2023.findings-emnlp.643/ (accessed on 1 March 2025).
Xiang, Y.; Yu, H.; Gong, Y.; Huo, S.; Zhu, M. Text Understanding and Generation Using Transformer Models for Intelligent E-commerce Recommendations. arXiv 2024, arXiv:2402.16035. [Google Scholar]
Xu, X.; Zhou, Y.; Liu, Y.; Chen, H. Emerging Synergies Between Large Language Models and Machine Learning in E-commerce Recommendations. arXiv 2024, arXiv:2403.02760. [Google Scholar] [CrossRef]
Liu, D.; Yang, B.; Du, H.; Greene, D.; Hurley, N.; Lawlor, A.; Dong, R.; Li, I. RecPrompt: A Self-tuning Prompting Framework for News Recommendation Using Large Language Models. arXiv 2023, arXiv:2312.10463. [Google Scholar]
Wu, C.; Wu, F.; Qi, T.; Zhang, C.; Huang, Y.; Xu, T. MM-Rec: Visiolinguistic Model Empowered Multimodal News Recommendation. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’22), Madrid, Spain, 11–15 July 2022; pp. 2560–2564. [Google Scholar] [CrossRef]
Gao, S.; Fang, J.; Tu, Q.; Yao, Z.; Chen, Z.; Ren, P.; Ren, Z. Generative News Recommendation. arXiv 2024, arXiv:2403.03424. [Google Scholar]
Katz, G.; Sitton, H.; Gonen, G.; Kaplan, Y. Beyond the Surface: Uncovering Implicit Locations with LLMs for Personalized Local News. arXiv 2025, arXiv:2502.14660v. [Google Scholar]
Friedman, L.; Ahuja, S.; Allen, D.; Tan, Z.; Sidahmed, H.; Long, C.; Xie, J.; Schubiner, G.; Patel, A.; Lara, H.; et al. Leveraging Large Language Models in Conversational Recommender Systems. arXiv 2023, arXiv:2305.07961. [Google Scholar]
Irfan, R.; Khalid, O.; Khan, M.U.S.; Rehman, F.; Khan, A.U.R.; Nawaz, R. SocialRec: A Context-Aware Recommendation Framework With Explicit Sentiment Analysis. IEEE Access 2019, 7, 116295–116308. Available online: https://ieeexplore.ieee.org/abstract/document/8784145 (accessed on 17 December 2024). [CrossRef]
Zhang, K.; Yu, R.; Shen, Y.; Wang, T.; Zheng, Z.; Yu, P.S.; Xiong, H. Prospect: Personalized Recommendation on Large Language Model-based Agent Platforms. arXiv 2024, arXiv:2403.14468. [Google Scholar]
Jalan, R.; Prakash, T.; Pedanekar, N. LLM-BRec: Personalizing Session-based Social Recommendation with LLM-BERT Fusion Framework. In Proceedings of the 2nd Workshop on Generative Information Retrieval (Gen-IR) at the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, Washington, DC, USA, 14–18 July 2024; Available online: https://openreview.net/forum?id=gwHVlTNKsG (accessed on 12 December 2024).
Li, Z.; Yazdanpanah, V.; Wang, J.; Gu, W.; Shi, L.; Cristea, A.I.; Kiden, S.; Stein, S. TutorLLM: Customizing Learning Recommendations with Knowledge Tracing and Retrieval-Augmented Generation. arXiv 2024, arXiv:2502.15709. [Google Scholar]
Ma, B.; Khan, M.A.Z.; Yang, T.; Polyzou, A.; Konomi, S. How Good Are Large Language Models for Course Recommendation in MOOCs? arXiv 2024, arXiv:2504.08208. [Google Scholar]
Li, X.; Zhang, C. E4SRec: An Elegant Effective Efficient Extensible Solution of Large Language Models for Sequential Recommendation. arXiv 2023, arXiv:2301.12345. [Google Scholar]
Hua, W.; Xu, S.; Ge, Y.; Zhang, Y. OpenP5: Open-source Platform for Personalized Prompt Pretraining and Prediction. 2023. Available online: https://github.com/agiresearch/OpenP5 (accessed on 17 December 2024).
Bao, K.; Wang, Q.; He, X. TALLRec: An Effective and Efficient Tuning Framework to Align Large Language Model with Recommendation. arXiv 2023, arXiv:2305.00447. [Google Scholar]
Chen, Z.; Li, X.; Fan, X. Real-Time Personalization for LLM-based Recommendation with Customized In-Context Learning. arXiv 2023, arXiv:2311.07985. [Google Scholar]
Ma, Z.; Chen, H.; Ren, Y.; Natarajan, S.; Shah, C.; Agichtein, E.; Sun, J. Transforming Wearable Data into Health Insights using Large Language Model Agents. arXiv 2024, arXiv:2406.06464. [Google Scholar]
Hegde, N.; Vardhan, M.; Nathani, D.; Rosenzweig, E.; Speed, C.; Karthikesalingam, A.; Seneviratne, M. Infusing behavior science into large language models for activity coaching. PLoS Digit. Health 2024, 3, e0000431. [Google Scholar] [CrossRef]
Zhang, X.; Liu, Q.; Li, X. LoRec: Combating Poisons with Large Language Model for Robust Sequential Recommendation. arXiv 2024, arXiv:2403.11860. [Google Scholar] [CrossRef]

Figure 1. Extended workflow of the LLM4Rec literature review process.

Figure 2. LLM4Rec architecture. A modular design integrating user–item embeddings, extended LLM token vocabularies, attention mechanisms (self- and cross-), and prompting strategies (soft and hard) to generate adaptive recommendations or explanations. Source: Author’s own design.

Figure 3. Adoption trends of LLM paradigms in recommendation:blue for Discriminative, green for Generative, and red-orange for Hybrid approaches.

Table 5. Comparison of discriminative and generative paradigms in LLM4Rec. Source: Author’s own design.

Aspect	Discriminative Paradigm	Generative Paradigm
Task Focus	Classification, ranking, prediction	Open-ended recommendation, content generation
Adaptability	Requires domain-specific fine-tuning for best results	Supports zero-shot/few-shot scenarios using pre-trained knowledge
Key Strengths	High-precision and effective in structured, supervised environments	Flexible, can handle multimodal and conversational settings
Training Dependency	High; needs labeled interaction data for fine-tuning	Moderate; can be used out-of-the-box or lightly fine-tuned
Evaluation Metrics	NDCG, Recall@K, CTR, AUC	BLEU, ROUGE, Diversity, Human Evaluation, CTR
Use Cases	Sequential recommendation, CTR prediction, ranking	Conversational agents, narrative-driven recommenders, explanations
Examples	BERT4Rec [6], PALR [79]	GPT4Rec [4], RecMind [80]

Table 6. Summary of technical challenges and solutions in LLM4Rec. Source: Author’s own design.

Challenge Area	LLM4Rec-Specific Issues	Possible Solutions
Cross-Domain and Cross-Language Adaptation	Domain-specific tuning needed [26]; multilingual degradation in low-resource settings [95]; integration issues between structured and unstructured data [100]	Domain adaptation [26], multilingual tuning with mBERT, XLM-R [95], sparse attention, cross-modal integration [100]
Semantic Gap Between NLP and Recommendation Tasks	LLMs lack inductive bias for sequential patterns and structured metadata [6]; require extensive fine-tuning for user–item modeling [85]	BERT4Rec-style modeling [6], hybrid models with tabular input [85], metadata-aware tokenization
Scalability and Compute Constraints	High training/inference cost (GPT-3/4) [87], environmental impact [108], real-time infeasibility of large models [91]	Distillation (DistilBERT) [91], LoRA/adapters [86], sparse attention, carbon-aware optimization [108]
Cold Starts and Data Sparsity	Limited data for new users/items [81], masked pretraining [81]	Meta-learning, transfer learning, few-shot recommendation, collaborative filtering-enhanced prompts
Real-Time Recommendation	Large models unsuitable for ms-latency scenarios [99]; inference delay in LLM pipelines [98]	Lightweight models, caching, asynchronous reranking [98,99], TinyGPT variants
Bias and Fairness	Training corpora bias (gender, race) [110]; fairness metrics not native to LLMs [112]	Adversarial debiasing [111], demographic parity, counterfactual augmentation [131]
Privacy, Transparency, and User Control	Privacy leakage from memorized data [113]; explainability limitations [114]	Differential privacy [113], SHAP/LIME [114], opt-out tools, preference settings
Societal and Ethical Impact	Polarization/echo chambers [132], environmental and economic disparity [108]	Responsible AI frameworks [133], public audits, equitable compute access initiatives [108]

Table 7. LLM-Enhanced Retail Recommendation Models.

Model	Base Model	Limitations	Scalability	Latency Suitability	Multimodal	Training Cost/Eval
LLM-KERec [135]	GPT-style LLM + KG	Requires inferential KG construction; cold-start handling needs tuning	High	Moderate	No	Moderate/ HR@10 = 0.678
LLM-PKG [136]	GPT-3 + product KG	Needs prompt engineering for graph reliability	Moderate	Moderate	Yes	Moderate/ NDCG@10 = 0.652
CRS-LLM [137]	ChatGPT + CRS	Task split complexity in multi-agent flows	High	Moderate	No	Moderate/ F1@Turn = 0.711
ChatGPT-Rank [134]	ChatGPT (API-based)	Latency limits real-time inference	High	Low	No	High/Recall@20 = 0.689
HybridLLMRec [139]	LLM + GBDT/MLP fusion	Requires ensemble tuning across modalities	High	Moderate	Yes	High/Mixed-metric

Table 8. News and media-focused LLM4Rec Models. Source: Author’s own design.

Model	Base Model	Limitations	Scalability	Latency Suitability	Multimodal	Training Cost/Eval
T5 [26]	T5	High pretraining cost	High	Moderate	No	High/ROUGE-L = 0.387
RecPrompt [140]	LLM (GPT-style)	Prompt tuning requires extensive validation	Moderate	Moderate	No	Moderate/ NDCG@10 = 0.356
MM-Rec [141]	ViLBERT + BERT	Requires rich image–text alignment	Moderate	Low	Yes	Moderate/F1 = 0.78
GNR [142]	GPT-2	Limited support for real-time updates	Low	Low	Yes	High/BLEU = 0.22

Table 9. Social media-focused LLM4Rec models. Source: Author’s own design.

Model	Base Model	Limitations	Scalability	Latency Suitability	Multimodal	Training Cost/Eval
CoLLM [71]	Frozen LLM + Collaborative Embeddings	Needs collaborative history; frozen LLM restricts adaptability	High	Moderate	No	High/HR@10 = 0.642
RecLLM [144]	LLM + Retrieval-Augmented Dialogue	Requires high-quality user input for meaningful adaptation	Moderate	Moderate	No	High/ Precision@5 = 0.601
SocialRec [145]	Context-Aware + Sentiment Classifier	Sentiment clustering may oversimplify user diversity	High	High	No	Moderate/F1 = 0.732
Prospect [146]	Agent-Based LLM Coordination	Complex multi-agent embedding alignment	High	Moderate	Yes	High/BLEU = 0.31
LLM-BRec [147]	BERT + LLM Fusion	Limited by session context length and user profile noise	Moderate	High	Yes	Moderate/ Recall@20 = 0.684

Table 10. Education-focused LLM4Rec models. Source: Author’s own design.

Model	Base Model	Limitations	Scalability	Latency Suitability	Multimodal	Training Cost/Eval
E4SRec [150]	BERT4Rec	Limited to structured curricula	High	Moderate	No	Moderate/ HR@10 = 0.681
OpenP5 [151]	GPT-2	Weak in unstructured setups	Moderate	Low	No	Moderate/ Accuracy = 0.72
TALLRec [152]	GPT	Struggles with goal shifts	Moderate	Moderate	No	Moderate/ Recall@10 = 0.645
TutorLLM [148]	RAG + KT	High inference latency	Moderate	Low	No	Moderate/F1 = 0.683
RecMind [80]	GPT + Agent	Inefficient at scale	Low	Low	No	Moderate/F1 = 0.684
LLMs4MOOCs [149]	GPT + Prompt	Dataset domain bias	Moderate	Moderate	No	Moderate/ NDCG@10 = 0.688

Table 11. Healthcare-focused LLM4Rec models. Source: Author’s own design.

Model	Base Model	Limitations	Scalability	Latency Suitability	Multimodal	Training Cost/Eval
ClinicalBERT [85]	BERT	Limited to unstructured clinical notes	Moderate	Moderate	No	Moderate/AUC = 0.768
CMS [154]	BERT + Sensor Fusion	Needs high-quality wearables	Low	Low	Yes	High/F1 = 0.792
PALR [79]	Cross-Attention + BERT	Overfits on small datasets	Moderate	Moderate	No	Moderate/HR@10 = 0.658
BeCoLLM [155]	GPT-style LLM + Behavior Science	Needs behavioral context history	High	High	Yes	High/AUC = 0.812
ICL [153]	GPT-3	Inconsistent for frequent changes	Moderate	High	No	Moderate/F1 = 0.749
LoRec [156]	Transformer	Expensive adversarial training	Low	Low	No	High/AUC = 0.784
XLNet4Rec [34]	XLNet	Requires long exercise history	Moderate	High	No	High/NDCG@10 = 0.662

Table 12. Distribution of LLM architectures in surveyed works. Source: Author’s own design.

Model Type	Percentage	Examples/Characteristics
Transformer-Based Encoders	38%	BERT4Rec, RoBERTa, UniSRec are optimized for sequential and ranking tasks.
Generative Models	26%	GPT4Rec, RecMind support conversational, narrative, and open-ended recommendations.
Multimodal Models	18%	CLIP, RLMRec integrate visual, behavioral, and textual signals for richer context modeling.
Scalable Models	8%	Megatron-LM, Switch Transformer leverages tensor and pipeline parallelism for deployment.
Prompt/Instruction-Tuned	10%	FLAN, DeBERTa, and CoLLM enable rapid adaptation with few-shot or zero-shot prompts.

Table 13. Domain -wise distribution of LLM4Rec applications. Source: Author’s own design.

Domain	Percentage	Datasets and Notable Models
E-commerce	32%	Amazon, Taobao, AliExpress datasets’ use cases include cold-start prediction, multilingual personalization, and review-aware ranking (e.g., CoLLM, ChatGPT4Rec).
Healthcare	18%	ClinicalBERT, Synthea, ad HealthTweets focus on sensitive data handling, trust, and privacy-preserving recommendations.
Media & Entertainment	16%	Spotify, YouTube, MovieLens are applications in sequential and real-time personalization (e.g., RecMind, GPT4Rec).
Education	12%	EdNet, KDD Cup, and ASSISTments; LLMs model student behavior, adaptive learning, and knowledge tracing.
Social Media	11%	Twitter, Reddit, and Instagram use LLMs for personalized content feeds, engagement prediction, and toxicity filtering.
News & Lifestyle	11%	MIND, Yahoo! News, HeartSteps; LLMs support context-aware, sentiment-driven, and multilingual recommendation tasks.

Table 14. Emerging themes in LLM4Rec. Source: Author’s own design.

Theme	Key Models/Approaches	Observations and Challenges
Prompt Engineering	FLAN, CoLLM	Enables zero-shot/few-shot adaptation with minimal fine-tuning. Highly flexible but sensitive to prompt formulation.
Multimodal Fusion	RLMRec, RecVAE++	Integrates text, visual, and behavioral data. Improves performance in cold-start scenarios. Requires alignment strategies.
Latency and Scalability	RecMind, GPT4Rec	Powerful but slow in real-time settings. Autoregressive decoding increases inference time. Distillation and sparse models emerging.
Underexplored Domains	Civic Tech, EdTech, Public Health	Limited research exists. LLMs could support low-resource, multilingual environments, especially with instruction-tuning.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Shehmir, S.; Kashef, R. LLM4Rec: A Comprehensive Survey on the Integration of Large Language Models in Recommender Systems—Approaches, Applications and Challenges. Future Internet 2025, 17, 252. https://doi.org/10.3390/fi17060252

AMA Style

Shehmir S, Kashef R. LLM4Rec: A Comprehensive Survey on the Integration of Large Language Models in Recommender Systems—Approaches, Applications and Challenges. Future Internet. 2025; 17(6):252. https://doi.org/10.3390/fi17060252

Chicago/Turabian Style

Shehmir, Sarama, and Rasha Kashef. 2025. "LLM4Rec: A Comprehensive Survey on the Integration of Large Language Models in Recommender Systems—Approaches, Applications and Challenges" Future Internet 17, no. 6: 252. https://doi.org/10.3390/fi17060252

APA Style

Shehmir, S., & Kashef, R. (2025). LLM4Rec: A Comprehensive Survey on the Integration of Large Language Models in Recommender Systems—Approaches, Applications and Challenges. Future Internet, 17(6), 252. https://doi.org/10.3390/fi17060252

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

LLM4Rec: A Comprehensive Survey on the Integration of Large Language Models in Recommender Systems—Approaches, Applications and Challenges

Abstract

1. Introduction

Paper Organization

2. Materials and Methods

Research Strategy

3. Contributions and Scope of the Survey

Guiding Research Questions

4. Foundations of Recommendation Systems and LLM Integration

4.1. Traditional Recommender Systems

4.2. Large Language Models: Capabilities and Evolution

4.2.1. Bidirectional Encoder Representations from Transformers (BERT)

4.2.2. Generative Pre-Trained Transformer (GPT)

4.2.3. Text-to-Text Transfer Transformer (T5)

4.2.4. XLNet

4.2.5. RoBERTa (A Robustly Optimized BERT Pretraining Approach)

4.2.6. ALBERT

4.2.7. Electra (Efficiently Learning an Encoder That Classifies Token Replacements Accurately)

4.2.8. BART (Bidirectional and Auto-Regressive Transformers)

4.2.9. DistilBERT

4.2.10. Megatron-LM

4.2.11. ERNIE (Enhanced Representation Through Knowledge Integration)

4.2.12. FLAN (Fine-Tuned LAnguage Net)

4.2.13. DeBERTa (Decoding-Enhanced BERT with Disentangled Attention)

4.2.14. UniLM (Unified Language Model)

4.2.15. CTRL (Conditional Transformer Language Model)

4.2.16. LaMDA (Language Model for Dialogue Applications)

4.2.17. GLaM (Gated Language Model)

4.2.18. Multimodal Models: CLIP and DALL-E

4.3. LLMs in NLP vs. Recommendation

5. Architecture, Optimization, and Technical Challenges in LLM4Rec

5.1. Paradigms of LLM4Rec

5.1.1. Discriminative Paradigm

5.1.2. Generative Paradigm

5.2. Architecture and Design of LLM4Rec

5.3. Methodologies in LLM4Rec

5.4. Performance Evaluation and Benchmarking of LLM4Rec

5.4.1. Commonly Used Datasets and Benchmarks

5.4.2. Evaluation Metrics and Performance Indicators

5.5. Technical Challenges in LLM4Rec

5.6. Emerging Techniques

6. Applications of LLM4Rec Across Domains

6.1. E-Commerce and Retail

6.2. News and Media Recommendations

6.3. Social Media and Content Personalization

6.4. Educational Resources and Learning Recommendations

6.5. Health and Lifestyle Recommendations

7. Discussion

7.1. Comparative Analysis of Surveyed Works

7.1.1. Model Type Distribution

7.1.2. Domain-Specific Applications

7.1.3. Paradigm Adoption Trends

7.1.4. Analysis and Emerging Themes

8. Conclusions and Future Research Directions

Funding

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI