Enhancing Course Recommendation with LLM-Generated Concepts: A Unified Framework for Side Information Integration

Tianyuan Yang; Baofeng Ren; Chenghao Gu; Feike Xu; Boxuan Ma; Shin’ichi Konomi

doi:10.3390/bdcc9120311

,

and

¹

Graduate School of Information Science and Electrical Engineering, Kyushu University, Fukuoka 819-0395, Japan

²

Faculty of Arts and Science, Kyushu University, Fukuoka 819-0395, Japan

^*

Author to whom correspondence should be addressed.

Big Data Cogn. Comput.2025, 9(12), 311;https://doi.org/10.3390/bdcc9120311

Version Notes

Order Reprints

Abstract

Massive Open Online Courses (MOOCs) have gained increasing popularity in recent years, highlighting the growing importance of effective course recommendation systems (CRS). However, the performance of existing CRS methods is often limited by data sparsity and suffers under cold-start scenarios. One promising solution is to leverage course-level conceptual information as side information to enhance recommendation performance. We propose a general framework for integrating LLM-generated concepts as side information into various classic recommendation algorithms. Our framework supports multiple integration strategies and is evaluated on two real-world MOOC datasets, with particular focus on the cold-start setting. The results show that incorporating LLM-generated concepts consistently improves recommendation quality across diverse models and datasets, demonstrating that automatically generated semantic information can serve as an effective, reusable, and scalable source of side knowledge for educational recommendations. This finding suggests that LLMs can function not merely as content generators but as practical data augmenters, offering a new direction for enhancing robustness and generalizability in course recommendation.

Keywords:

large language models; educational data mining; cold-start recommendation; data augmentation; concept generation

1. Introduction

With the rapid growth of Massive Open Online Courses (MOOCs) as a scalable and accessible alternative to traditional classroom-based education, learning opportunities have become more democratized than ever before. However, the overwhelming number of available courses has made it increasingly difficult for learners to identify those most aligned with their individual goals and interests [1]. To address this challenge, course recommendation systems have emerged as a necessary tool to reduce information overload and assist students in making informed and efficient course selections. Despite their effectiveness, most existing systems rely heavily on historical user–course interactions, which makes them particularly vulnerable to data sparsity and cold-start problems, especially for new users or newly introduced courses [2].

Incorporating side information, such as course concepts or topical knowledge, has been shown to improve both the accuracy and interpretability of recommendation systems, while also helping to mitigate challenges related to data sparsity and cold-start scenarios [3,4,5,6,7]. These approaches have demonstrated strong performance in modeling user preferences when rich auxiliary information is available. However, manually extracting such information is labor-intensive, time-consuming, and difficult to scale. Although some automatic extraction techniques have been explored, they often fall short in terms of quality and reliability [8,9].

Recent developments in educational AI have sparked growing interest in applying large language models (LLMs), such as GPT, to support learning-related tasks [10,11]. Thanks to their broad general knowledge and strong contextual reasoning capabilities, LLMs can summarize content, infer key topics, and generate concept-level representations from unstructured course descriptions. Motivated by these capabilities, our previous work [10,12] explored the use of LLMs to automatically extract core concepts from course materials. Through both qualitative and quantitative analyses, we found that the generated concepts were semantically coherent, aligned closely with course content, and in some cases even surpassed the quality of concepts derived from expert annotations. Despite this promise, the use of LLM-generated concepts remains largely underexplored in the context of recommender systems. In particular, it is still unclear whether such automatically generated concepts can serve as effective side information to improve recommendation performance, especially in data-sparse or cold-start scenarios. Although prior studies have shown that incorporating side information can improve recommendation accuracy, it remains unclear whether automatically generated concepts, rather than manually crafted or domain-specific features, can serve as effective side signals in course recommendation. This uncertainty is particularly critical in education, where scalable and transferable sources of side information are often limited. In this work, we take a different perspective: instead of focusing on architectural modifications, we systematically examine whether LLM-generated course concepts can enhance recommendation quality when integrated into classical and modern baselines. Specifically, we explore multiple fusion strategies to inject these semantic concepts into different families of recommendation models, and we analyze which integration methods yield the largest gains, why they work, and under what conditions they are most effective. By shifting the focus from model redesign to evaluating the utility of LLM-derived concepts as side information, our study provides new empirical evidence on the role of LLMs in educational recommendation and offers practical insights for improving accuracy in both cold-start and sparse-data scenarios. To this end, we raise a central research question: How and to what extent can LLM-generated course concepts, injected as model-agnostic side information via multiple integration strategies, improve recommendation quality across classical and modern recommender systems, particularly under sparsity and cold-start conditions?

In this work, we propose a general and lightweight framework for integrating LLM-generated course concepts as side information into various classic recommendation algorithms. Unlike prior approaches that require model-specific modifications, our framework is architecture-agnostic and supports multiple fusion strategies, allowing seamless adaptation to a wide range of recommender models. Importantly, our use of LLMs is confined to the data preparation stage, which ensures minimal computational overhead and high applicability in real-world settings. To evaluate the effectiveness and generalizability of our approach, we conduct extensive experiments on two publicly available MOOC datasets. We apply our framework to a variety of baseline models and observe consistent performance improvements, particularly in cold-start scenarios where side information is most beneficial. Additionally, we compare the impact of different LLMs on concept generation and downstream recommendation quality, providing further insights into how model choice influences effectiveness. Our results demonstrate that even without model architectural changes, LLM-generated concepts can serve as effective side information to enhance recommendation accuracy in a low-cost, scalable manner.

To summarize, our main contributions are as follows:

We propose a novel data augmentation approach that leverages LLM-generated course concepts as side information to improve recommendation performance. Our method operates solely at the data preparation stage, without involving the inference phase, ensuring low computational overhead.
We design a general and flexible integration framework that supports multiple fusion strategies and can be applied across a wide range of recommendation algorithms without modifying their architectures.
We conduct comprehensive experiments on two publicly available MOOC datasets and demonstrate that our approach consistently improves performance, particularly in cold-start scenarios.
We perform a comparative analysis of concepts generated by different LLMs and examine their downstream impact on recommendation effectiveness, providing practical insights into model selection and data quality.

2. Related Work

2.1. Course Recommendation

Early developments in course recommendation systems primarily relied on content-based methods and collaborative filtering (CF) approaches [13,14,15,16,17]. For example, Polyzou et al. [18] applied random walk-based techniques to capture sequential dependencies between courses, while Wagner et al. [19] leveraged traditional machine learning to identify course selection patterns and reduce dropout risks. With the rise of deep learning, a number of neural approaches have been proposed to improve recommendation performance. Gong et al. [20] introduced an attentional graph convolutional network to incorporate concept relationships. Zhang et al. [3] and Pardos et al. [21] explored hierarchical and connectionist models for better preference learning. Jiang et al. [22] modeled goal-based recommendation processes, while Yu et al. [23] proposed a hierarchical reinforcement learning framework that expanded course concepts in a multi-stage manner. Gao et al. [5,24] developed relation-graph and sequence-based models to capture complex interactions between students, exercises, and course content. More recently, there has been growing interest in improving explainability in course recommendation. Yang et al. [6] introduced KEAM, which constructs user profiles from course-specific knowledge graphs and leverages them to deliver explainable course recommendations. Building on this, MAECR [25] further incorporates multi-perspective meta-paths and dual-side modeling to capture both student preferences and course suitability.

Despite these advances, course recommendation remains uniquely challenging. Issues such as data sparsity, cold-start scenarios, and the lack of structured, interpretable representations of course content continue to hinder performance and user trust. While knowledge graphs and hand-crafted side information have been explored, they often require manual effort and are difficult to generalize across domains. In particular, few studies have investigated the use of automatically generated side information, such as course concepts produced by LLMs, to enhance course recommendations. These gaps motivate our work.

2.2. Concept Extraction

Identifying key concepts within educational content has long been considered essential for supporting student understanding and guiding course selection. The task of concept extraction has attracted considerable attention, particularly in the context of MOOCs and digital learning resources, which involves automatically identifying and representing important knowledge elements within a course. Early efforts in this area employed a variety of semi-supervised [26], embedding-based [8], and graph-driven approaches [27] to extract course concepts. Changuel et al. [28] leveraged prerequisite annotations to construct concept sequences. Yu et al. [29] incorporated external knowledge bases and interactive refinement. While effective to a certain extent, these methods often suffer from scalability issues, over-reliance on textual inputs, limited generalizability, and high computational costs due to complex model designs. More recent work has begun exploring the integration of LLMs for enhancing concept identification [10,11,30]. These models offer the ability to infer semantically coherent and context-aware concepts from raw course descriptions without relying on structured annotations. Our previous research [12] has shown that LLM-generated concepts can achieve high semantic quality and strong alignment with course content.

However, most existing studies treat concept extraction as a standalone task and do not explore how LLM-generated concepts can be used in downstream educational tools. To the best of our knowledge, no prior work has investigated the use of these automatically generated concepts as side information to improve course recommendation systems—a gap we aim to address in this study.

2.3. Large Language Models in Education

LLMs such as GPT, Gemini, and Claude, pre-trained on vast text corpora, exhibit strong general knowledge and reasoning capabilities. These models have achieved remarkable performance in tasks such as machine translation, text summarization, and question answering [12,31]. Their emergence has also opened new opportunities in education, including automated content generation, personalized learning experiences, and enhanced educational tools [32].

A growing body of research has explored the application of LLMs in educational contexts. For instance, GPT has been studied for generating course-aligned concepts to improve interoperability [11], assisting in qualitative codebook development [33], and producing feedback for counselor training [34]. Lohr et al. [35] employed retrieval-augmented generation (RAG) to generate semantically annotated quiz questions tailored to specific courses, while Kieser et al. [36] demonstrated that ChatGPT-4 can simulate realistic student responses for concept inventories, enabling scalable data augmentation in physics education. These studies show that LLMs can generate pedagogically relevant content that supports teaching and assessment. In parallel, several works have begun integrating LLMs into course recommendation systems [37]. Coursera-REC [38] combines LLMs with RAG to generate personalized MOOC recommendations with natural-language explanations. Meanwhile, CR-LCRP [39] utilizes multi-granularity data augmentation within a heterogeneous learner–course network to improve recommendations under sparse conditions.

Despite these advancements, most existing work treats LLMs as generation engines for full recommendations or textual feedback, rather than as tools for producing structured, reusable side information. In particular, the use of LLM-generated course concepts—as semantic signals that can be systematically integrated into conventional recommendation models—remains largely unexplored. Our work fills this gap by leveraging LLMs to generate high-quality concept-level features that are integrated into a recommendation framework, enabling data-efficient enhancement without architectural modification.

2.4. LLM-Based Data Augmentation in Recommender Systems

Data augmentation has emerged as a critical strategy for improving model robustness and mitigating sparsity in recommender systems by enriching training data with additional signals [40,41]. Recently, LLMs have emerged as a promising tool for data augmentation due to their generative capabilities and access to rich contextual knowledge. For example, ColdLLM [42] and LLMRec [43] generate synthetic user–item interactions or expand graph-based relations to strengthen learning under sparse conditions. Wang et al. [44] rely on LLMs to produce pairwise comparison preferences or aspect-level attributes that enhance downstream models through preference modeling or representation learning. These approaches typically focus on simulating user behaviors or preferences to create additional interaction signals.

While these methods have shown promising results, they predominantly use LLMs to simulate behavior-level data, such as clicks, preferences, or rankings, thereby augmenting interaction information. In contrast, few studies have investigated the use of LLMs to generate structured side information that can be directly injected into recommendation models. Our work addresses this gap by employing LLMs to extract course-level concepts from textual descriptions and using these concepts as semantic features to improve recommendation performance. This approach offers a lightweight and architecture-agnostic form of augmentation that is especially valuable in educational recommendation scenarios.

3. Proposed Framework

In this section, we present a concept-based data augmentation framework designed to enhance course recommendations by incorporating side information automatically generated by large language models (LLMs). The framework is motivated by the observation that concept-level information can provide valuable semantic signals beyond interaction data, especially in scenarios where user behavior is sparse or unavailable. Instead of relying on manually annotated features or domain-specific extraction pipelines, our method utilizes LLMs to infer relevant course concepts from available course-related information. These concepts are then encoded into dense semantic vectors using a pre-trained encoder and injected into various recommendation models through a set of lightweight, architecture-independent integration strategies. The overall framework is designed to be general, efficient, and compatible with a wide range of existing algorithms. This section introduces the core components of our approach, including an overview of the framework, the concept generation process, and the integration mechanisms.

3.1. Overview of the Framework

As illustrated in Figure 1, the framework consists of three core components: Concept Generation, Concept Embedding, and Side Information Integration. It is worth noting that the user embeddings shown in Figure 1 are obtained from the users’ historical interaction data. Each baseline recommender model (e.g., MF, NeuMF, LightGCN, FM, ItemKNN) computes these embeddings through its own standard parameterization process. For instance, matrix factorization learns user latent vectors by decomposing the user–item interaction matrix, while graph-based models like LightGCN aggregate neighbor signals to form user representations. In our framework, these user embeddings remain unchanged and serve as the basis for integrating the additional concept-level side information through the strategies described in Section 3.3.

Figure 1. Overview of the proposed framework for integrating LLM-generated course concepts into recommendation models.

In the first stage, LLMs are used to infer a set of relevant concepts for each course based on its available information. These concepts are intended to reflect the key topics or knowledge areas covered by the course. In the second stage, the generated concepts are transformed into semantic vector representations using a pre-trained embedding model, enabling them to capture rich contextual meaning. In the final stage, the resulting concept embeddings are integrated into various baseline recommendation models through multiple model-agnostic fusion strategies, including linear addition, attention-based weighting, gating mechanisms, concatenation, and similarity-based integration. Notably, this integration process does not require any architectural modifications to the underlying models.

An important feature of our framework is that the LLM is used only during the data preparation stage. Once the concepts and their embeddings are obtained, they can be reused across different recommendation algorithms without repeated LLM inference, resulting in low computational overhead and strong scalability. We detail the concept generation and embedding procedure in Section 3.2, and introduce the integration strategies in Section 3.3.

3.2. LLM-Based Concept Generation & Embedding

3.2.1. Concept Generation with LLMs

To generate course-level concept information, following our previous work [12], we utilize LLMs to produce structured semantic concepts based on course-related input. This process is guided by a prompt design that systematically instructs the LLM on the generation task. As illustrated in Figure 2, each prompt for both datasets is composed of three components: task description, format indicator, and information injection.

Figure 2. Prompt design for guiding LLM-based course concept generation across different datasets, consisting of task description, format indicator, and information injection.

The task description defines the objective of the generation. In our case, we explicitly specify that the LLM is expected to generate a set of course-relevant knowledge concepts. This instruction helps orient the model toward identifying topical or thematic keywords that capture the core content of the course. The format indicator defines the expected structure of the output, including the number of concepts and the formatting style. Standardizing the output format is crucial for ensuring that the generated results can be parsed and processed consistently. To enforce format compliance, we implement a retry mechanism that automatically re-prompts the LLM if the response deviates from the predefined structure. In our implementation, we require the LLM to generate at least 30 concepts for each course. The information injection component provides course-specific details that serve as context for the generation task. The content of this section varies depending on the dataset. In particular, the MOOCCube dataset contains rich metadata, including the course title, a full course description, and a list of manually annotated concepts. These components are concatenated and provided to the LLM as contextual input. In contrast, the XuetangX dataset contains only course titles, which serve as the sole input. This difference is not due to design choice but reflects the inherent data availability across the two datasets. Formally, let

x_{i}

denote the course-related input (e.g., course title, course description), and let

P (x_{i})

represent the constructed prompt, as illustrated in Figure 2. The LLM processes this prompt to generate the

c o u r s e_{i}

’s concept set

C_{i} = {c_{1}, c_{2}, \dots, c_{k}}

, which can be systematically obtained as follows:

C_{i} = LLM (P (x_{i})),

where the output set

C_{i}

is required to contain at least 30 well-structured concept terms, in accordance with the format constraints specified in the prompt template. To maintain generality and avoid fine-tuning, we adopt a one-shot prompting strategy across all settings. The one-shot design allows the model to infer the structure of the desired output based on a single example included in the prompt. This choice balances performance and simplicity while ensuring that the generation process remains model-agnostic and scalable. Using this prompt formulation, the LLM produces a rich and diverse set of candidate concepts for each course, forming a semantically meaningful basis for downstream embedding and integration. To ensure completeness, we briefly clarify how we validated and used the generated concepts. Building on our prior study [12], which systematically evaluated GPT-generated course concepts on the same two datasets (MOOCCube and XuetangX), we rely on previously established evidence that such concepts are semantically coherent, aligned with course topics, and—in several cases—closer to learner-perceived relevance than legacy “ground-truth” tags. Importantly, ref. [12] also demonstrated that even when the input is limited to concise course titles (as in XuetangX), minimal-context prompting still yields consistent and high-quality concept sets. In this work, we intentionally avoid heavy post-processing: apart from format checking and de-duplication, the concepts are used as generated. We do not assume that LLM-generated concepts are clean, perfectly aligned, or low-noise; in fact, large language models may produce substantial numbers of imperfect, off-topic, or hallucinated terms. Human-readability alone does not imply semantic correctness or interpretability, and concept-level structures produced by LLMs do not constitute reliable reasoning paths. Although our experiments show that the aggregate semantic signal can still yield measurable improvements in accuracy, this should not be interpreted as evidence that hallucination is negligible. Rather, the generated concepts in this study function strictly as auxiliary semantic features, whose reliability is further discussed and mitigated through the strategies presented in Section Hallucination Risks and Mitigation Strategies. We do not evaluate, claim, or imply interpretability of LLM-generated concepts.

3.2.2. Semantic Embedding of Concepts

Once the concept set

C = {c_{1}, c_{2}, \dots, c_{n}}

has been generated for a given course, each concept needs to be transformed into a numerical vector that can be processed by recommendation models. This step bridges the gap between symbolic natural language concepts and dense feature representations commonly used in machine learning systems. By embedding the concepts into a continuous semantic space, we enable their integration as structured side information across various model architectures. A wide range of techniques has been developed for textual and knowledge-based embeddings. Traditional word-level embeddings such as GloVe [45] and Word2Vec [46] capture co-occurrence statistics from large corpora. Contextual language models, such as BERT [47], provide dynamic representations that are sensitive to input context. Additionally, knowledge graph embedding methods like TransE [48], TransR [49], and Concept2Vec [50] are capable of modeling structured relationships between entities. Beyond these, general-purpose text embedding models such as InferSent [51], Universal Sentence Encoder [52], and Sentence-BERT [53] have demonstrated strong performance in mapping variable-length texts into semantic vectors suitable for downstream applications such as retrieval, classification, and inference tasks.

In this work, we employ the OpenAI text-embedding-3-large model to encode each concept into a fixed-dimensional vector. This model offers a general-purpose semantic encoder that maps input strings, such as words or short phrases, into 3072-dimensional dense embeddings. Specifically, we input the name of each concept into the encoder to obtain a word-level semantic representation. All embeddings are computed using the default settings of the encoder, without any fine-tuning. Prior to encoding, the input text is lowercased and whitespace-normalized, and duplicate concepts within each course are removed. Each processed concept

c_{i} \in C

is then transformed into a 3072-dimensional semantic vector, denoted as:

e_{i} = f (c_{i}), e_{i} \in R^{3072}

where

f (\cdot)

represents the embedding function implemented by the text-embedding-3-large encoder model. To obtain a course-level embedding, we aggregate all concept embeddings associated with the course. Let

E = {e_{1}, \dots, e_{n}}

be the set of embeddings for concepts in

C

. We compute the aggregated representation as:

e^{c o u r s e} = Pool (\{e_{1}, \dots, e_{n}\}),

(1)

where

Pool (\cdot)

is a pooling function such as element-wise mean, sum, or max. In our main experiments, we adopt mean pooling due to its simplicity, effectiveness, and model-agnostic nature. The resulting course-level concept embedding

e^{c o u r s e}

provides a compact and semantically enriched representation of the course. It can be easily incorporated into various recommendation models as additional side information. All concept embeddings are precomputed and cached, allowing for efficient reuse across models and experimental configurations, thereby supporting scalability and low inference cost. Importantly, since the course-level embedding is derived from a diverse set of semantic concepts, it encapsulates rich topical information that complements interaction-based signals, enabling downstream models to make more informed and context-aware recommendations.

3.3. Side Information Integration Strategies

After deriving course-level representations by aggregating LLM-generated concept embeddings, we aim to incorporate this external semantic signal into various components of the recommendation pipeline. To achieve this, we propose a unified and model-agnostic framework for side information integration, specifically tailored to concept-level knowledge in this study. The proposed framework is designed to be modular and extensible, allowing it to seamlessly augment existing recommendation models without requiring architectural modifications. It supports multiple integration strategies, each enabling a different way to inject semantic knowledge into the pipeline while preserving compatibility with standard model structures. Detailed information fusion strategies are shown in Figure 3. We categorize the fusion strategies into three types based on how the concept-level embeddings participate in the recommendation process:

Figure 3. Illustration of the three concept integration strategies: representation-level fusion, concatenation-based fusion, and structural-level fusion.

Representation-level Fusion, where concept embeddings are combined with course embeddings to generate a new semantic representation used in prediction;
Concatenation-based Fusion, where concept embeddings are treated as additional input features and concatenated with user and item embeddings before modeling;
Structural Fusion, where concept information is used to enhance similarity measures or graph structures without directly modifying item representations.

Furthermore, for each baseline model, we adopt multiple fusion variants to assess the effectiveness of different strategies under a unified experimental setting. It is important to note that each fusion category encompasses several concrete implementations (e.g., linear, gated, or attention-based fusion under the representation-level strategy), reflecting the flexibility and extensibility of our integration framework.

3.3.1. Representation-Level Fusion

Representation-level fusion refers to the integration of LLM-derived concept embeddings into the course representation space in a way that directly replaces or transforms the original item embedding used in prediction. Specifically, for a given user

u_{i}

and course

c_{j}

, let

e_{u_{i}}

denote the user embedding learned from interaction data,

e_{c_{j}}

the ID-based course embedding also learned from interaction data, and

e_{j}^{c o u r s e}

the concept-level embedding of course

c_{j}

, which is obtained via pooling over all its associated concept vectors in Concept Embedding component as mentioned in Equation (1). A fusion function

F

is applied to combine

e_{c_{j}}

and

e_{j}^{c o u r s e}

, resulting in a fused representation

{\hat{e}}_{c_{j}}

. The final prediction score is then computed as:

{\hat{e}}_{c_{j}} = F (e_{c_{j}}, e_{j}^{c o u r s e}),

{\hat{y}}_{i, j} = f (e_{u_{i}}, {\hat{e}}_{c_{j}}),

where

f (\cdot)

is the model-specific interaction function, such as a dot product or neural scoring module. We explore several instantiations of the fusion function

F

, each providing different inductive biases and levels of adaptability. The most straightforward approach is linear fusion, where a scalar weight

α \in [0, 1]

controls the contribution of the two embeddings:

{\hat{e}}_{c_{j}} = α e_{c_{j}} + (1 - α) e_{j}^{c o u r s e} .

While effective, this method applies the same mixing ratio across all items and dimensions. To allow more expressive control, we adopt a gated fusion mechanism inspired by gated neural networks. A dimension-wise gating vector

z_{i} \in {[0, 1]}^{d}

is computed as:

z_{j} = σ (W_{1} e_{c_{j}} + W_{2} e_{j}^{c o u r s e} + b),

{\hat{e}}_{c_{j}} = z_{j} ⊙ e_{c_{j}} + (1 - z_{j}) ⊙ e_{j}^{c o u r s e},

where

σ (\cdot)

is the sigmoid activation function and ⊙ denotes element-wise multiplication. This formulation enables per-dimension control over the contribution of each information source. Furthermore, we implement an attention-based fusion mechanism that dynamically assigns soft weights to the two embeddings based on their joint representation, as previously defined in this section. A query vector

q_{j}

is constructed to attend to the two information sources. Depending on the model architecture,

q_{j}

can be derived from either the user embedding

e_{u_{i}}

(e.g., in collaborative models) or the item embedding

e_{c_{j}}

itself (e.g., in graph-based models). In general, we define it as follows:

q_{j} = W_{q} \cdot x_{j},

where

x_{j} \in R^{d}

is a context embedding and

W_{q}

is a trainable parameter. The attention mechanism uses two key–value pairs constructed as follows:

k_{1} = W_{k 1} \cdot e_{c_{j}}, k_{2} = W_{k 2} \cdot e_{j}^{c o u r s e},

v_{1} = W_{v 1} \cdot e_{c_{j}}, v_{2} = W_{v 2} \cdot e_{j}^{c o u r s e},

where all transformation matrices are learnable, scaled dot-product attention is used to compute the weights:

α_{j} = \frac{exp (\frac{q_{j}^{⊤} \cdot k_{j}}{\sqrt{d}})}{\sum_{m = 1}^{2} exp (\frac{q_{j}^{⊤} \cdot k_{m}}{\sqrt{d}})}, j \in {1, 2} .

The final fused course representation is then given by:

{\hat{e}}_{c_{j}} = α_{1} \cdot v_{1} + α_{2} \cdot v_{2} .

This dynamic fusion strategy enables the model to softly adjust its reliance on structural versus semantic information depending on the interaction context, offering a more flexible and adaptive representation space. Overall, representation-level fusion serves as a general and model-agnostic mechanism that enables seamless semantic enrichment of course embeddings, making it readily applicable to a wide range of recommendation architectures without modifying their interaction functions.

3.3.2. Concatenation-Based Fusion

Concatenation-based fusion incorporates concept-level information by appending it to existing user and item representations prior to prediction. Unlike representation-level fusion, which replaces or transforms the original course embedding, this strategy preserves the base embeddings and treats the concept embedding as an auxiliary feature vector. The combined vector is then passed to the downstream scoring module, such as a factorization machine or multilayer perceptron. This fusion strategy is model-agnostic and applies uniformly across a wide range of recommendation architectures. The core idea is to concatenate the concept-level embedding

e_{j}^{c o u r s e}

with either the ID-based course embedding

e_{c_{j}}

, the user embedding

e_{u_{i}}

, or both. When the embedding dimensions differ, the concept vector can be projected or preprocessed to align with the target latent space. The fused input can then be formally expressed as:

x_{i, j} = Concat (e_{u_{i}}, e_{c_{j}}, e_{j}^{c o u r s e}),

{\hat{y}}_{i, j} = f (x_{i, j}),

where

f (\cdot)

is the model-specific prediction function. Beyond using continuous embeddings, this strategy also allows incorporating concept-level information in structured or categorical formats, particularly in models that are designed to handle high-dimensional or heterogeneous input features. For instance, factorization-based methods and hybrid recommenders often support flexible feature injection. In this context, we explore several complementary variants that encode concepts as:

(i): One-hot vectors, where each concept is assigned a unique binary feature, and the entire concept set is represented as a sparse binary vector. This enables the model to learn individual interactions between users, items, and specific concepts, and is particularly suited to models that support high-dimensional sparse input, such as factorization machines.
(ii): Signal-separated inputs, where the concept embedding is not simply concatenated with item embeddings but passed through a separate modeling pathway (e.g., an independent FM component). This allows the model to capture interactions that are specific to the concept signal without conflating it with ID-based features.
(iii): Cluster-level categorical features, where semantically similar concepts are grouped into clusters and represented as discrete IDs. These cluster IDs are then embedded and concatenated with existing input vectors, allowing the model to capture group-level semantics while reducing dimensionality and noise.

All of these follow the same fusion principle, enriching the input space by directly concatenating semantic features, while offering flexibility in how concept information is encoded. In summary, concatenation-based fusion provides a unified and extensible mechanism for incorporating concept-level signals into recommendation models. By treating semantic knowledge as additional input features, this strategy enables the model to learn more expressive and flexible representations without modifying the interaction structure.

3.3.3. Structural Fusion

Structure-level fusion introduces concept-level information by enhancing the underlying structures used in recommendation, such as item similarity matrices or graph connectivity. Instead of altering embedding representations, this strategy modifies how information flows between items, thereby capturing semantic relationships not evident from interaction data alone. We begin by constructing a semantic-aware similarity matrix

\tilde{S}

that combines interaction and concept-level semantic information. For any pair of courses

c_{j}

and

c_{k}

, the fused semantic-aware similarity can be defined as:

{\tilde{S}}_{j, k} = λ \cdot {Sim}_{i n t e r a c t i o n} (c_{j}, c_{k}) + (1 - λ) \cdot {Sim}_{c o n c e p t} (c_{j}, c_{k}),

(2)

where

{Sim}_{i n t e r a c t i o n}

is derived from user interaction signals (e.g., cosine similarity between implicit feedback vectors), and

{Sim}_{c o n c e p t}

is computed based on the cosine similarity between their concept-level embeddings

e_{j}^{c o u r s e}

and

e_{k}^{c o u r s e}

. The balance parameter

λ \in [0, 1]

controls the relative importance of behavioral versus semantic similarity. The resulting similarity matrix

\tilde{S}

serves as a unified structural representation. In neighborhood-based methods, it can guide similarity-weighted prediction. In graph-based models, it forms the basis of an augmented item–item adjacency matrix

\tilde{A}

which supports more informed message propagation. To ensure sparsity and preserve computational efficiency, we optionally filter out weak semantic connections by introducing a thresholding mechanism. The final adjacency matrix

\tilde{A}

can be obtained as follows:

{\tilde{A}}_{j, k} = \{\begin{matrix} {\tilde{S}}_{j, k}, & if {Sim}_{concept} (c_{j}, c_{k}) \geq θ, \\ 0, & otherwise . \end{matrix}

where

θ

is a hyperparameter controlling semantic edge density, semantic connections are retained only when concept similarity exceeds a predefined threshold. The resulting matrix

\tilde{A}

is then used to guide message passing or similarity-weighted aggregation in the downstream recommendation model. This enriched structure allows the model to propagate signals through both observed interactions and concept-level semantic relations. As a result, items that share similar topics or knowledge domains become structurally connected, which improves generalization in sparse or cold-start settings.

Taken together, we introduce three strategies for integrating concept-level side information: representation-level, concatenation-based, and structure-level fusion. Each strategy operates at a different stage of the modeling pipeline, including representation, input, and structure. All are supported under a unified, model-agnostic framework that ensures compatibility across collaborative filtering, factorization-based, and graph-based methods. This flexibility allows the framework to serve as a general optimization layer for diverse recommendation models. Importantly, the use of LLM-derived concept embeddings provides rich semantic signals at low cost, without requiring labeled data or domain-specific engineering, which makes the approach both scalable and practical for real-world educational applications.

4. Experimental Setup

4.1. Datasets

In the context of this work, we focus on the scenario of course recommendation within an MOOC environment. We evaluate our framework on two publicly available MOOC datasets: MOOCCube [54] and XuetangX [3], which differ in scale and the richness of course metadata. This diversity enables us to assess the robustness of our concept-based augmentation approach across both high- and low-resource scenarios. MOOCCube is a large-scale dataset focused on online education, containing 706 real-world MOOC courses, 199,199 users, and 672,853 user–course interaction records. Each course is accompanied by a title, a full course description, and a set of manually annotated concepts. These metadata elements make MOOCCube a high-resource dataset suitable for testing the full capacity of our concept generation and integration pipeline. XuetangX, by contrast, is derived from a real-world MOOC platform and includes 82,535 users, 1302 courses, and 458,453 interaction records. However, it provides only course titles without additional metadata such as descriptions or expert-curated concepts. This makes XuetangX a representative low-resource setting for evaluating the generalizability of our method under limited textual input. To ensure data quality and consistency, we apply several preprocessing steps to both datasets. Specifically, we remove courses with fewer than 10 user interactions and filter out users with fewer than 5 interactions to stabilize model learning and evaluation. Additionally, we exclude courses whose content (e.g., graduation projects or university-specific theses) may introduce semantic ambiguity. After filtering, the final dataset statistics are summarized in Table 1. Since the richness of course metadata differs between the two datasets, we adopt different prompt designs for LLM-based concept generation. For MOOCCube, we include the course title, description, and annotated concepts in the prompt as contextual input. For XuetangX, where only the course title is available, we use it as the sole input to the LLM. This design allows us to evaluate the effectiveness of our concept generation strategy under varying information conditions.

Table 1. Statistics of the two datasets.

4.2. Evaluation Metrics

To estimate the effectiveness of each model, we rely on six widely recognized metrics following previous research [3,55,56]: Precision@K, Recall@K, Hit Ratio (HR@K), Normalized Discounted Cumulative Gain (NDCG@K), and Mean Reciprocal Rank (MRR@K) of Top-K recommendations. It’s important to mention that the higher metric values signify superior performance.

Precision@K is a measure for computing the fraction of relevant items out of all the recommended items, which can be obtained as:

Precision @ K = \frac{|L_{u} (K) \cap T_{u}^{+}|}{| L_{u} (K) |},

where

L_{u} (K)

being the recommendation list based on Top-K,

T_{u}^{+}

being all relevant items in test dataset for user.

Recall@K is defined as the proportion of relevant items that are retrieved, which can be obtained as:

Recall @ K = \frac{|L_{u} (K) \cap T_{u}^{+}|}{T_{u}^{+}},

where

L_{u} (K)

being the recommendation list based on Top-K,

T_{u}^{+}

being all relevant items in test dataset for user. It is to answer the coverage question, among all those considered relevant items.

Hit Ratio (HR@K) is a recall-based metric that measures the percentage of the ground truth instances that are successfully recommended in the top-K recommendation, which is defined as:

H R @ K = \frac{NumberOfHits @ K}{G T},

where GT means ground truth, the total number of items actually of interest to all users in the test set.

Normalized Discounted Cumulative Gain(NDCG@K) is a measure of ranking quality, extended by the Discounted Cumulative Gain (DCG). NDCG@K is computed as:

N D C G @ K = \frac{1}{I D C G @ K} \sum_{p = 1}^{K} \frac{2^{r_{p}} - 1}{{log}_{2} (1 + p)},

where p is the position of an item in the recommendation list and

I D C G @ K

indicates the score obtained by an ideal ranking of

L_{u} (K)

.

Mean Reciprocal Rank (MRR@K) evaluates the average of the reciprocal ranks of the first relevant item in the Top-K recommendation list for each user. It reflects how early the model is able to retrieve the first relevant item. Specifically, if no relevant item appears within the Top-K recommendations, the reciprocal rank is considered to be zero. The metric is computed as:

MRR @ K = \frac{1}{| U |} \sum_{u = 1}^{| U |} \frac{1}{{rank}_{u}},

where

{rank}_{u}

denotes the position of the first relevant item in the Top-K recommendation list

L_{u} (K)

for user u, and

| U |

is the total number of users.

4.3. Baselines

To evaluate the effectiveness of our proposed framework in the course recommendation task, we benchmark its performance against a set of representative baseline models. Each baseline is evaluated both in its original form and with the integration of our LLM-generated concept embeddings, enabling a clear comparison that highlights the contribution of our side information fusion strategies. We select one representative algorithm from each of six major families of recommendation approaches, ensuring a diverse and comprehensive evaluation. We compared and extended the different baseline methods given below:

Collaborative Filtering—Item-based KNN [57]: a classic memory-based collaborative filtering method that models user and item based on item similarity obtained by interaction information.
Matrix Factorization—MF [58], learns latent representations of users and items by decomposing the user–item interaction matrix.
Factorization Machines—LightFM [59], a hybrid recommender system combining collaborative filtering and content-based technique, these two baselines differ in the loss function.
Deep Learning-based Methods—Neural Matrix Factorization (NeuMF) [60], combines generalized matrix factorization (GMF) and multilayer perceptrons (MLP) to jointly model both linear and nonlinear user–item interactions, resulting in more expressive preference learning.
Graph Neural Network-based Models—LightGCN [61], propagates user and item embeddings over a bipartite interaction graph using simplified graph convolutions to enhance collaborative filtering.
Knowledge-enhanced Models—KEAM [7], integrates course knowledge graphs into an autoencoder-based architecture whose intermediate layer is instantiated with concept nodes and course–concept links. While this design yields a structured representation, it must not be interpreted as an explanation mechanism. Human-readable nodes are a necessary but by no means sufficient condition for explainability, and concept-layer activations do not represent a faithful reasoning process unless explicitly evaluated. In this work, KEAM is treated purely as a conventional recommender that consumes concept embeddings, without attributing any form of interpretability to its architecture or outputs.

These baselines span a diverse set of modeling paradigms, covering both traditional and deep learning-based approaches, as well as models that utilize structured knowledge. For each baseline, we implement two versions: a vanilla model and a version augmented with LLM-generated concept embeddings using our proposed fusion strategies. Importantly, the architecture of each model remains unchanged; the integration is performed solely at the input level. This design enables a fair and architecture-agnostic evaluation of our framework, demonstrating its flexibility and effectiveness across a wide range of recommendation methods.

4.4. Training Details

All models were implemented using PyTorch 2.1.2 and trained on a single NVIDIA RTX 4060 Ti GPU. For each dataset, we randomly split the user–course interaction records into training, validation, and test sets, following an 80/20 ratio. We ensured that each user retained at least one interaction in either the training or test set to guarantee valid evaluation. Unless otherwise noted, all models use an embedding size of 64 for both users and items. To ensure fair comparisons, we performed hyperparameter tuning for each model and selected the best configuration on the validation set. The optimization was conducted using either the Adam or the SGD optimizer, with the better-performing choice adopted for each baseline. The learning rate and other hyperparameters were selected through grid search. Most models were trained with the Bayesian Personalized Ranking (BPR) loss. For each positive interaction, we sampled one negative course that the user had not interacted with. An exception is the matrix factorization model, which was trained using the mean squared error (MSE) loss with

L_{2}

regularization, consistent with its original formulation. Training was run for up to 50 epochs, with early stopping based on the validation hit ratio (HR). The batch size was set to 2048 across all models.

To incorporate external semantic knowledge, we used concept-level course embeddings generated by the GPT family of large language models. In particular, the majority of experiments relied on GPT-4.1 to extract and represent course concepts in a high-dimensional semantic space. The resulting 3072-dimensional embeddings were projected into the same 64-dimensional latent space as ID-based embeddings via a learnable transformation layer. This projection enabled the semantic information to be seamlessly fused into various recommendation models under different fusion strategies. All models were trained under a unified protocol to isolate the effects of fusion strategies from other confounding factors.

4.5. Model-Specific Integration Settings

We apply the proposed concept-level integration framework to a set of representative recommendation models, each belonging to a different architectural family. For each model, we adopt one or more fusion strategies defined in Section 3.3, including representation-level, concatenation-based, and structure-level fusion. Table 2 summarizes the specific fusion variants adopted for each baseline model. For the Collaborative Filtering (ItemKNN) model, we adopt the structure-level fusion strategy by enhancing the similarity matrix with concept-aware semantic signals. Specifically, we compute a hybrid similarity score as a weighted combination of interaction-based similarity and concept-based similarity, as described in Equation (2). The mixing coefficient

α

is empirically set to 0.3, 0.5, or 0.7, and the best-performing value is selected on the validation set. In the Matrix Factorization (MF) model, we implement both representation-level and concatenation-based fusion. Representation-level fusion includes linear, gated, and attention-based mechanisms, each producing a fused course representation before prediction. Additionally, we include a concatenation variant where the concept embedding is appended to the ID-based embeddings and processed through a neural scoring function. For the Factorization Machines (FM) model, we focus on concatenation-based fusion, aligning with its original formulation that emphasizes direct integration of side information. However, given the high dimensionality of LLM-generated concept embeddings, we employ dimensionality reduction via projection and explore several fusion variants, including one-hot encoding, cluster-level categorical features, signal-separated channels, and DNN-based transformations. These allow the model to ingest semantic features while preserving parameter efficiency. The NeuMF model mirrors the fusion design of MF and includes all four representations and concatenation variants: linear, gated, attention-based, and direct concatenation. The fused input is then propagated through multilayer perceptron layers to learn nonlinear interactions. In LightGCN, we apply all three fusion strategies. Representation-level fusion includes linear, gated, and attention-based mechanisms for embedding refinement. Concatenation-based fusion appends concept embeddings to the ID-based inputs before graph convolution. Structure-level fusion is achieved by augmenting the course–course graph with additional edges between conceptually similar items, where concept similarity exceeds a cosine threshold of 0.85. Finally, in the KEAM model, which builds student profiles based on knowledge graphs, we replace the original domain-specific knowledge base with concept graphs constructed from LLM-generated course concepts. Replacing a manually defined domain knowledge base with LLM-generated concepts enables KEAM to operate on datasets lacking predefined concept structures (e.g., XuetangX). However, such substitution also introduces additional hallucination risks, since both LLM-generated concepts and automatically constructed graphs may contain semantic noise or structural inconsistencies. We therefore treat the resulting concept layer as a flexible semantic enrichment mechanism rather than a validated knowledge graph or an interpretable reasoning substrate. The mitigation strategies outlined in Section Hallucination Risks and Mitigation Strategies describe how these risks can be moderated in practice. The original KEAM model heavily relies on rich course annotations from the dataset, a limitation common to most knowledge graph-based recommenders. By leveraging LLM-generated concepts, we alleviate this dependency and provide a scalable alternative that retains semantic depth while substantially improving generalizability across sparse or low-resource settings.

Table 2. Summary of side information integration strategies applied to each baseline model.

5. Results

5.1. Overall Performance Comparison

To evaluate the effectiveness of our concept-level augmentation framework, we report the top-performing fusion results for each baseline model across both the MOOCCube and XuetangX datasets. Table 3, Table 4, Table 5 and Table 6 comprehensively summarize the performance comparison of standard baselines and their enhanced versions under different top-K settings (K = 5, 10, 15, 20) on both datasets. Additionally, Figure 4 visualizes the relative gains in HR@10, NDCG@10, and MRR@10 metrics, highlighting the consistent improvements brought by integrating LLM-generated concepts.

Table 3. Performance of Recommendation Models on MOOCCube Dataset at K = 5 and K = 10.

Table 4. Performance of Recommendation Models on XuetangX Dataset at K = 5 and K = 10.

Table 5. Performance of Recommendation Models on MOOCCube Dataset at K = 15 and K = 20.

Table 6. Performance of Recommendation Models on XuetangX Dataset at K = 15 and K = 20.

Figure 4. Performance comparison of baseline and concept-augmented models on the MOOCCube and XuetangX datasets. The Y-axis shows the absolute metric values (HR@10, NDCG@10, and MRR@10). The improvement percentages above each pair indicate the relative gains over the baseline.

On the MOOCCube dataset, all baseline models clearly benefit from incorporating the generated concept information. Notably, ItemKNN, MF, and LightGCN achieve remarkable performance gains. Specifically, ItemKNN shows improvements of HR@10, NDCG@10, and MRR@10 scores by +33.7%, +29.9%, and +23.4%, respectively, clearly demonstrating that simple collaborative-filtering approaches substantially benefit from enriched semantic signals. Similarly, MF exhibits substantial gains of HR@10 score by +45.7%, NDCG@10 score by +95.7%, and MRR@10 score by +125.3%, highlighting how latent factor-based methods benefit considerably from external semantic enrichment. The improvement observed in LightGCN (HR@10 score by +18.6%, NDCG@10 score by +21.5%, MRR@10 score by +18.3%) indicates that even models leveraging graph structures can effectively integrate and utilize external concept information to enhance prediction quality. Additionally, other evaluated models such as NeuMF and KEAM also exhibit meaningful performance gains, demonstrating the broad applicability of our proposed concept-level augmentation approach in improving recommendation quality. We note that our evaluation focuses on performance improvement rather than interpretability assessment. NeuMF, a neural network-based collaborative filtering model, achieves improvements of the HR@10 score by +7.3%, NDCG@10 score by +9.4%, and MRR@10 score by +10.5%. This suggests that deep neural models, which inherently capture complex nonlinear interactions, can further leverage semantic signals from the generated concepts, thereby enhancing recommendation effectiveness. KEAM, as a knowledge-enhanced model, inherently utilizes external knowledge through concept integration. Nevertheless, even KEAM demonstrates measurable improvements (+1.2% in HR@10 score, +2.2% in NDCG@10, and MRR@10 scores). This highlights that LLM-generated concepts provide valuable complementary signals beyond manually annotated metadata, even when rich information is already available.

On the XuetangX dataset, our results indicate consistent performance improvements across the majority of baseline models after integrating the LLM-generated concept embeddings, despite the dataset’s limited metadata (only course titles available). Specifically, pronounced relative improvements are observed in traditional collaborative filtering methods, such as ItemKNN and MF, as well as in more advanced models, including LightGCN and KEAM. For instance, at HR@10 score, ItemKNN improves by +37.4%, MF by +14.7%, LightGCN by +5.8%, and KEAM notably by +45.1%. Similar trends emerge at NDCG@10, where the improvements are particularly striking: ItemKNN (+58.5%), MF (+54.9%), LightGCN (+38.4%), and KEAM (+50.7%). Additionally, MRR@10 metrics exhibit robust performance gains, further underscoring the general effectiveness of our concept-level augmentation strategy. However, we observe a performance decrease in the FM model across all evaluated metrics. We attribute this negative outcome primarily to the integration strategy employed by FM, which relies solely on concatenation-based fusion. Given XuetangX’s inherently sparse context (course titles only), concatenating multiple concept embeddings directly could introduce excessive semantic noise and redundancy, negatively impacting FM’s ability to effectively model interactions. In contrast, more structured or flexible models—such as LightGCN, which utilizes graph-based embeddings, and KEAM, enhanced with knowledge graph-based representations—show greater resilience to noise and effectively leverage the semantic richness provided by the generated concepts. The remarkable performance uplift seen in KEAM is particularly noteworthy, which is a knowledge-enhanced model already designed to incorporate structured external knowledge, and still meaningfully benefits from our concept-generation approach, especially on the XuetangX dataset. On XuetangX, KEAM initially suffers from limited metadata to construct a high-quality knowledge graph, leading to weak baseline performance. However, after introducing LLM-generated concepts, KEAM achieves notable relative improvements (HR@10, NDCG@10 and MRR@10 scores by +45.1%, +50.7% and +54.0%, respectively). This clearly demonstrates that our framework effectively bridges the semantic gap, substantially improving recommendation performance even when existing metadata is sparse or entirely missing. These results highlight a critical insight: the choice of integration strategy is paramount, especially in resource-limited datasets like XuetangX. Although our augmentation framework universally offers meaningful semantic enrichment, carefully matching fusion methods to the model architecture is essential to ensure optimal performance gains.

Another critical advantage of our framework is its practical efficiency and scalability. Since the LLM-generated concepts are produced only once during the data preparation stage, they require minimal computational cost and are fully reusable across different recommendation models or future runs. Importantly, our integration approach does not involve any changes to the underlying model architectures, ensuring seamless adaptability and ease of implementation in diverse real-world scenarios. This low-cost generation and high reusability, coupled with consistent and substantial performance gains, position our approach as an attractive and practical augmentation strategy for large-scale educational recommendation systems. In summary, these experimental results confirm not only the semantic quality and relevance of the generated concept features but also validate the effectiveness, flexibility, and practical utility of our proposed integration strategies. Our framework successfully demonstrates a robust and generalizable solution to enhancing educational recommendation performance across diverse modeling approaches and metadata availability conditions.

5.2. Effectiveness of Different Integration Strategies

To deeply investigate the effectiveness of integrating LLM-generated course concepts, we systematically evaluated various integration strategies across four representative recommendation frameworks: Matrix Factorization (MF), Factorization Machines (FM), Neural Matrix Factorization (NeuMF), and Light Graph Convolutional Networks (LightGCN). Specifically, we examined four general strategies, Linear, Attention-based, Concatenation (Concat), and Gated fusion, alongside model-specific encodings for FM (One-hot, Sign-Discrete, Cluster, DNN) and a Graph-enhanced integration method for LightGCN. The comprehensive results at K = 10 are illustrated in Figure 5 and Figure 6.

Figure 5. Performance of integration strategies applied to MF and NeuMF models on the MOOCCube and XuetangX datasets, evaluated at K = 10 using HR@10, NDCG@10, and MRR@10.

Figure 6. Performance of integration strategies applied to FM and LightGCN models on the MOOCCube and XuetangX datasets, evaluated at K = 10 using HR@10, NDCG@10, and MRR@10.

From the Integration Strategy Perspective, clear performance patterns emerged. For simpler matrix-based models like MF, the Concatenation strategy demonstrated clear effectiveness, particularly on the MOOCCube dataset, achieving the highest HR@10 (0.426), NDCG@10 (0.194), and MRR@10 (0.174). This indicates that directly augmenting item embeddings with rich semantic information substantially enhances their representation capabilities. Conversely, the Attention-based fusion showed stronger results on the XuetangX dataset (HR@10 = 0.343), likely due to its adaptive weighting mechanism, which effectively handles limited semantic contexts. In contrast, Linear and Gated fusions underperformed in the simpler MF architecture, highlighting their inability to adequately model complex semantic interactions in this context. For FM-based models, deeper semantic interactions captured through neural network-based encoding (FM + DNN) notably outperformed simpler encodings like One-hot or Cluster-based methods. On the MOOCCube dataset, FM + DNN achieved superior results, with HR@10 (0.432), NDCG@10 (0.195), and MRR@10 (0.206), clearly surpassing other variants. A similar but less prominent advantage emerged on XuetangX, reaffirming that capturing sophisticated nonlinear semantic interactions through neural architectures notably enhance the FM framework, especially when semantic signals are relatively rich. For complex hybrid architectures such as NeuMF and LightGCN, adaptive integration strategies consistently provided superior results. NeuMF exhibited robust performance gains using Attention and Gated fusions, with Gated fusion notably achieving competitive HR@10 (0.426) and the highest MRR@10 (0.160) on MOOCCube. Likewise, LightGCN strongly favored the Gated fusion approach, achieving remarkable results (HR@10 = 0.595 on MOOCCube; HR@10 = 0.531 on XuetangX). Notably, the Graph-enhanced strategy, which directly integrates concepts into the underlying graph structure, markedly underperformed (e.g., HR@10 = 0.494 on MOOCCube). This unexpected finding suggests that direct manipulation of graph structures might introduce semantic redundancy or noise, negatively affecting graph embedding quality.

Analyzing from the Dataset Perspective, we specifically focused on data sparsity, a common issue in recommendation scenarios. Both datasets lacked rich, explicitly annotated semantic course information, relying instead solely on the limited semantic context provided by LLM-generated concepts. Nevertheless, clear performance improvements were observed on the MOOCCube dataset. For instance, MF’s HR@10 improved substantially from a baseline of 0.292 to 0.426 with Concatenation fusion, clearly demonstrating the efficacy of automatically generated concepts in alleviating sparsity. Similarly, complex models like LightGCN also benefited considerably from semantic augmentation, indicating the general robustness of adaptive integration methods. The XuetangX dataset presented even greater sparsity challenges, given its shorter course titles. Despite this limitation, substantial performance gains were observed, particularly when using dynamic fusion strategies. LightGCN’s Gated fusion notably improved HR@10 from the baseline of approximately 0.501 to 0.531. These results underscore the suitability of adaptive embedding fusion methods for effectively exploiting sparse semantic signals.

From the Baseline Methodological Perspective, we explicitly compared our enhanced models against traditional baselines without semantic embeddings. Models enhanced with LLM-generated concepts consistently outperformed their baselines across multiple metrics. The MF and FM models on MOOCCube showed considerable improvements, clearly underscoring the benefits of semantic enrichment. On XuetangX, dynamic fusion methods (Attention and Gated) also demonstrated substantial improvements over baseline performances, validating the general effectiveness of semantic integration even under severely sparse conditions.

These findings offer several critical insights. First, simple models such as MF substantially benefit from straightforward semantic embedding concatenation, effectively overcoming inherent limitations related to sparsity. Second, FM-based frameworks require deeper, nonlinear semantic interactions, highlighting the value of neural-based encodings. Third, adaptive fusion methods like Gated and Attention strategies consistently outperform simpler integration approaches in complex architectures such as NeuMF and LightGCN, emphasizing the importance of tailored integration methods according to model complexity. Moreover, the interplay between dataset semantic richness and integration strategy effectiveness was particularly notable. Richer contexts like MOOCCube notably amplify semantic embedding integration effects, while sparse datasets like XuetangX illustrate the necessity and robustness of adaptive methods. An intriguing yet counterintuitive result was that directly enhancing the graph structure with semantic embeddings (Graph-enhanced fusion) underperformed, highlighting potential drawbacks in semantic redundancy and noise introduction, thus indicating a need for carefully controlled semantic integration. In summary, our findings provide clear empirical evidence that integrating LLM-generated semantic embeddings enhances recommendation effectiveness across diverse models and dataset contexts. These insights not only inform theoretical developments in semantic feature integration but also offer practical guidelines for deploying effective recommendation systems in educational settings.

5.3. Influence of Different LLMs

To investigate how different large language models (LLMs) and prompt configurations for concept generation affect recommendation, we conducted a systematic evaluation across 20 LLM + prompt combinations. Specifically, we selected four representative LLMs, which are GPT3.5, GPT4o, GPT4omini, and GPT4.1, and paired each with five prompting strategies: ZeroShot, OneShot, Description, Concept, and All Information. These prompt types, inspired by recent studies [12], vary in the amount and type of contextual information provided to the LLM. These prompting strategies are aligned with the P1–P5 templates proposed in prior work [12], allowing us to assess how incremental context affects the model’s generation behavior. For example, ZeroShot uses only the course title without examples or additional input, while OneShot includes a single in-context example. Concept adds original human-annotated concepts, Description provides the course description, and All Information integrates all available metadata to form the most complete context. Importantly, all LLMs were queried using the same task instructions and format templates to ensure fairness in generation quality comparison as shown in Section 3.2. Our goal in this section is not to benchmark large language models themselves, but to investigate whether LLM-generated concepts can serve as effective side information and how the quality of such concepts affects recommendation performance. To control confounding factors such as tokenizer differences, prompt syntax, and output formatting, we selected four models from the same family—GPT-3.5, GPT-4o-mini, GPT-4o, and GPT-4.1—representing a clear spectrum of reasoning capability and cost. This design enables a controlled examination of whether concept quality, rather than architectural or vendor differences, drives the observed performance gains. Using models from one provider also ensures consistent prompting and reproducibility under a unified API environment. Future work will extend this analysis to open-source and domain-specific LLMs (e.g., Llama 3, Mistral, Qwen) to assess cross-model robustness and generalizability. All experiments were conducted on the MOOCCube dataset. We selected 100 target courses and used each of the 20 LLM+prompt configurations to generate course-level concepts. To create a realistic recommendation evaluation setting, we retained only users who had interacted with at least one of the selected courses. For each user, one interaction was randomly held out for testing. This yielded a test set comprising 13,153 users and 30,969 interactions. Following the concept generation stage, we encoded each course’s concept set using the same semantic embedding process described in Section 3.2.2, and obtained a 3072-dimensional embedding vector for each course. These embeddings were then projected into a shared latent space and injected into the LightGCN model for evaluation via the fusion strategies introduced earlier. This setup allows us to directly assess how the choice of LLM and prompt affects both the semantic quality of generated concepts and the resulting recommendation performance.

To better understand the effect of different LLM and prompt choices on downstream recommendation quality, we perform a multi-dimensional analysis from three perspectives: (1) the influence of LLM architecture, (2) the impact of prompt design, and (3) the overall efficacy of integrating LLM-generated concepts. This allows us to comprehensively evaluate not only the raw performance differences but also the underlying patterns and implications for future design of educational recommendation systems enhanced by large language models. Our results reveal that the choice of LLM architecture strongly influences downstream recommendation performance. As shown in Figure 7 and Figure 8, GPT4.1 consistently outperforms other models across all evaluation metrics, achieving the highest Recall@10 score of 0.4972 with the ZeroShot prompt. GPT4o and GPT4omini follow closely, with best Recall@10 scores of 0.4881 and 0.4939, respectively. GPT3.5 lags, with its highest score capped at 0.4885. The overall ranking of LLMs remains stable across metrics, indicating the robustness of this trend. These findings suggest that more advanced LLMs, particularly GPT4.1, possess a stronger ability to extract meaningful and generalizable concepts from limited course information, which translates into better user–item matching in recommendations. Notably, even the performance of GPT4omini, a lightweight variant, is highly competitive, indicating that parameter scale is not the sole determinant of utility. This opens up opportunities for applying smaller models in resource-constrained educational settings.

Figure 7. Heatmap visualization of recommendation performance on the MOOCCube dataset, showing HR@10, NDCG@10, and MRR@10 scores for combinations of four LLMs (GPT3.5, GPT4omini, GPT4o, GPT4.1) and five prompt types (ZeroShot, OneShot, Description, Concept, AllInfo).

Figure 8. Performance comparison of different LLM models and prompt configurations on MOOCCube dataset.

Prompt design also plays a critical role in shaping recommendation quality. When aggregating results across all LLMs, we observe the following average HR@10 scores per prompt type: ZeroShot (0.4919), OneShot (0.4880), Description (0.4821), Concept (0.4783), and All Information (0.4815). Surprisingly, the minimal-context prompts (ZeroShot and OneShot) outperform richer-context prompts (Description, Concept, and AllInfo) across nearly all LLMs. This trend holds not only in HR@10 but also in NDCG@10 and MRR@10, reflecting that simpler prompts often yield more effective concepts for downstream recommendation tasks. One possible explanation is that excessive or heterogeneous input—such as full descriptions or noisy human-annotated concepts—may introduce redundancy, ambiguity, or distractive signals. On the other hand, concise prompts like course titles might force LLMs to rely more on pre-trained knowledge and abstraction capabilities, thus producing more transferable and generalized concept representations. Among all configurations, GPT4.1 with ZeroShot prompt delivers the highest HR@10 of 0.4972, and this trend is consistent across other models like GPT4o and GPT4omini. At first glance, this seems counterintuitive—one would expect that providing more context should help the model generate better outputs. However, this observation resonates strongly with prior findings from human evaluation in previous research in educational scenarios [12], where human raters also preferred concepts generated using simpler prompts (e.g., P1, analogous to ZeroShot). Our results thus echo and extend this insight: less is more, especially when the downstream task requires semantic abstraction and transferability, rather than memorization or domain-specific fit. This further suggests that human-aligned semantic quality and machine-aligned recommendation utility may co-evolve under similar prompting patterns, reinforcing the value of prompt minimalism.

Importantly, we find that all 20 LLM + prompt configurations outperform the no-concept baseline, demonstrating that LLM-generated concepts are beneficial across the board. Even the lowest-performing combination, GPT4.1 with All Info prompt, achieves a Recall@10 of 0.4704, which is still substantially above the LightGCN baseline without concept information. This validates our core hypothesis: LLM-generated concepts, even when prompted with minimal input, can effectively serve as side information to improve personalized recommendations. The consistent lift across different models and prompts highlights the robustness of this approach and its potential scalability to other datasets or domains. In summary, our findings underscore the strong influence of both LLM selection and prompt design on the quality of generated concepts and their downstream utility. More powerful LLMs such as GPT4.1 consistently yield better results, and prompts with minimal context (e.g., ZeroShot and OneShot) often outperform more elaborate ones. The fact that ZeroShot configurations perform best—mirroring human preference trends observed in prior work—highlights the non-trivial dynamics of prompting. Importantly, all combinations tested offer tangible gains over the baseline, validating the practical value of LLM-generated concepts for enhancing personalized course recommendations.

5.4. Cold-Start Scenario Recommendation

Cold-start is one of the core challenges in recommender systems, where models struggle to make accurate predictions due to the lack of sufficient user interaction history. In our study, we specifically address the user cold-start problem and investigate whether integrating GPT-generated course concepts can alleviate this issue. We consider two levels of cold-start. In the MOOCCube dataset, we select users with no more than 5 historical interactions (173,249 users), while in the XuetangX dataset, we consider users with no more than 3 interactions (31,269 users). For each user, we hold out one interaction as the test instance. This setup allows us to evaluate the robustness of our framework under both moderate and extreme sparsity conditions. The concept fusion strategy adopted for each baseline is the best-performing variant identified in previous experiments. The results under the cold-start scenario are presented in Table 7 and Table 8, while Figure 9 illustrates the relative improvements of the recommendation models after incorporating GPT-generated course concepts. To better understand the effectiveness of our proposed framework under cold-start conditions, we conduct a multi-perspective analysis across two datasets, various model types, and performance comparison with the general (non-cold-start) setting, as shown in Section 5.1. To facilitate clearer cross-section comparison and avoid requiring readers to flip back to Section 5.1, we additionally summarize in Table 9 and Table 10 the top-10 performance results (HR@10, NDCG@10, and MRR@10) of all six baseline models under both general and cold-start conditions. These summary tables provide a concise view of how concept integration influences performance across different levels of data sparsity.

Table 7. Cold-Start Recommendation Performance on MOOCCube Dataset at K = 5, 10, 15, and 20.

Table 8. Cold-Start Recommendation Performance on XuetangX Dataset at K = 5, 10, 15, and 20.

Figure 9. Performance comparison of baseline and concept-augmented models under the cold-start setting at K = 10. The bars show absolute performance values, while the numbers above indicate the relative improvements (%) compared with the baseline.

Table 9. Comparison of Top-10 Recommendation Performance between General and Cold-Start Settings on XuetangX Dataset.

Table 10. Comparison of Top-10 Recommendation Performance between General and Cold-Start Settings on MOOCCube Dataset.

First, we examine the differences between the two datasets in terms of cold-start severity and observed gains. MOOCCube represents a moderate cold-start scenario, with users having ≤5 interactions, while XuetangX presents a more extreme case, limiting user histories to ≤3 interactions. Despite these challenges, our framework demonstrates remarkable adaptability. On MOOCCube, all models benefit from concept integration, with consistent performance improvements across metrics and top-K values. Notably, models such as MF and LightGCN achieve substantial gains, e.g., MF improves HR@10 from 0.1996 to 0.3242 and MRR@10 from 0.0577 to 0.1356. On the more challenging XuetangX dataset, the improvements remain robust, especially for models capable of flexibly leveraging semantic information. For instance, LightGCN increases HR@10 from 0.4565 to 0.5437, while KEAM jumps from 0.2317 to 0.5103. These findings highlight the strong generalization ability of our framework, even when only minimal contextual information (e.g., course titles) is available. Second, we analyze how different model types benefit from LLM-generated concepts under the cold-start scenario. Traditional collaborative filtering models like ItemCF and MF exhibit the most pronounced relative improvements. For example, ItemCF’s NDCG@10 improves by +130.5% on MOOCCube and +61.0% on XuetangX, showing that even simple models can greatly benefit from external semantic signals. On the other hand, models such as NeuMF and LightGCN also show measurable but more moderate improvements, as they already capture complex user–item interactions internally. Importantly, the performance of FM deserves closer attention. Unlike in Section 5.1, where FM shows stable gains on MOOCCube, in the XuetangX cold-start setting, FM’s performance drops after incorporating concept embeddings (e.g., MRR@10 from 0.0590 to 0.0546). This counterintuitive result is likely due to the lack of rich contextual signals in XuetangX and the FM model’s reliance on concatenation-based fusion, which may introduce noise when concept signals are limited or overly redundant. Third, comparing cold-start results to those in Section 5.1 under general settings, we find both similarities and distinctions. In both cases, concept integration consistently improves performance, confirming the utility of semantic enrichment. However, the gains are often more pronounced in cold-start scenarios, particularly for models originally hampered by sparse interaction data. MF, for instance, exhibits a relative improvement of over +60% in HR@10 under cold-start, compared to a smaller gain under full data settings. This indicates that the impact of concept information is magnified when behavioral signals are scarce, further reinforcing the value of our augmentation strategy in sparse recommendation environments.

In summary, our lightweight and model-agnostic framework shows clear advantages in handling cold-start scenarios by leveraging GPT-generated course concepts as semantic side information. It not only boosts accuracy across a wide range of models and datasets but also maintains low computational cost, as concept generation is performed once during data preprocessing. The ability to improve performance without architectural changes makes our approach highly practical for real-world deployment. Ultimately, these results demonstrate that our framework is well-suited for enhancing course recommendations in both rich and sparse environments, offering a scalable solution to one of the most persistent challenges in recommender systems. It is important to note that the proposed framework introduces minimal additional computational burden. The LLM is used only once during the data preprocessing stage to generate course concepts, after which all downstream training and inference rely solely on the conventional recommender model without invoking the LLM again. This design allows the generated concept embeddings to be stored, reused, and shared across different models or datasets without repeated computation. Moreover, the integration strategies (e.g., linear addition, concatenation, gating, or attention) operate on low-dimensional vectors within the same latent space, preserving the original time and space complexity of the base recommenders. Consequently, our framework maintains high efficiency while achieving substantial accuracy gains, making it suitable for real-world educational platforms with limited computational resources.

6. Discussion

This study demonstrates that integrating GPT-generated course concepts into recommendation models consistently improves performance, especially under cold-start scenarios. Traditional models such as MF and ItemCF benefit most from semantic enrichment, as these architectures lack intrinsic mechanisms for contextual reasoning. The observed improvements are more pronounced in the MOOCCube dataset, where richer course metadata and diverse user behaviors enable stronger alignment between generated concepts and actual user preferences. In contrast, the XuetangX dataset poses a more extreme challenge, with minimal textual information and shorter user histories, which expose the limitations of direct fusion strategies. For instance, the FM model underperforms when combined with generated concepts on XuetangX, likely due to noise amplification from concatenation-based integration, suggesting that adaptive or attention-based fusion mechanisms may be necessary to handle sparse or noisy semantic inputs.

Beyond performance gains, these findings reveal several broader implications. First, LLM-generated concepts can act as a universal semantic interface between language models and recommender systems, providing structured meaning representations that are transferable across models and datasets. This indicates that concept-level augmentation can bridge the gap between symbolic course knowledge and collaborative signals, potentially improving model interpretability and robustness. Second, the consistent benefits observed across both simple and complex recommenders suggest that the proposed framework captures a general principle: semantic abstraction compensates for behavioral sparsity. Even without additional training or architectural modifications, pre-generated semantic signals can serve as a lightweight yet powerful substitute for dense interaction data.

Despite promising results, several limitations remain. First, GPT-generated concepts are inherently influenced by the quality and scope of input course descriptions. They may contain redundant or noisy information, particularly when input data is sparse or ambiguous. Second, the current fusion strategies are manually selected for each model, raising concerns about generalizability and scalability across other architectures or domains. Third, while the approach works well for text-rich cold-start scenarios, it may be less effective in domains where semantic descriptions are limited or unavailable. Another limitation concerns the lack of a direct baseline that embeds the full course descriptions using the same model as the proposed approach. While such a comparison would help isolate the specific contribution of concept abstraction, it is difficult to implement consistently across datasets. The XuetangX dataset does not include textual descriptions, and the descriptions in MOOCCube are often lengthy, heterogeneous, and noisy, containing HTML tags or instructor biographies that weaken the semantic signal. In addition, the OpenAI text-embedding-3-large model used in this work imposes input-length constraints, which make direct encoding of many raw descriptions infeasible. Preliminary checks suggested that directly embedding these unprocessed texts leads to unstable and redundant representations, whereas GPT-generated concepts yield concise and semantically coherent vectors that better capture the essential meaning of each course. The improvement observed when incorporating these concept vectors into a knowledge-graph-based model such as KEAM further indicates that the abstraction step provides complementary semantic information rather than duplicating what the encoder already learns. Future work will include a systematic comparison of different levels of textual representation, from raw descriptions to LLM-generated concepts, in order to quantify the semantic contribution of the abstraction process.

The current study primarily evaluates recommendation accuracy because all baseline models, including MF, NeuMF, LightGCN, FM, and KEAM, are optimized for predictive precision rather than for enhancing diversity or coverage. This focus ensures methodological consistency and facilitates direct comparison with existing work in course recommendation, where accuracy-based metrics such as Precision, Recall, HR, nDCG, and MRR are widely adopted as standard benchmarks. In large-scale MOOC settings, the number of courses is typically far smaller than the number of learners, which differs fundamentally from conventional item recommendation and makes accuracy a more reliable indicator of system performance. Evaluating catalog coverage or diversity in this context further requires a well-defined taxonomy of course categories that is not yet available in the current datasets. Future work will extend the experimental design by introducing diversity-oriented baselines and constructing structured course categories, enabling a more comprehensive assessment that includes diversity, coverage, novelty, and interpretability in addition to accuracy.

Nevertheless, the proposed framework offers practical implications. Its lightweight, plug-and-play nature makes it particularly suitable for real-world educational platforms, especially for newly registered users or newly launched courses where interaction data is lacking. The one-time offline generation of concepts ensures low computational overhead, while semantic enrichment provides immediate benefit without altering model structures. This makes the approach highly deployable in production systems facing frequent cold-start issues.

Future work could explore the development of end-to-end frameworks that unify GPT-based concept generation, representation learning, and recommendation into a single jointly optimized pipeline. Additionally, automating the selection of fusion strategies or incorporating multimodal signals, such as course videos, speech transcripts, and visual materials, may further enhance recommendation performance. To improve the reliability and pedagogical alignment of generated concepts, incorporating expert feedback or human-in-the-loop refinement mechanisms is essential. Moreover, enabling feedback loops from both students and instructors could support the construction of adaptive, need-based concept augmentation tools that better serve diverse educational contexts. Furthermore, this work does not include a dedicated evaluation of interpretability. Although the concept-level structure of our approach provides a natural foundation for explainable recommendations, future work will focus on systematically assessing this aspect through both quantitative metrics (e.g., concept attribution, explanation fidelity) and qualitative user studies. Such analyses will help verify whether the concept-based reasoning indeed enhances users’ understanding and trust in course recommendations.

Hallucination Risks and Mitigation Strategies

While LLM-generated concepts offer a lightweight and effective semantic abstraction for course recommendation, they also introduce inherent risks associated with hallucination. Prior work has shown that LLMs tend to fill knowledge gaps with fabricated or imprecise information, particularly when domain-specific evidence is limited or missing. LLMs often compensate for incomplete internal knowledge by generating plausible but incorrect statements, a behavior that becomes more prominent when prompts do not fully capture the underlying expert mental model [62]. In our context, such hallucinations may manifest as off-topic or overly generic course concepts, spurious relationships between concepts, or inconsistent higher-level semantic structures. Importantly, hallucination risks are not unique to LLMs alone. Automatically constructed knowledge graphs are also known to contain erroneous or ambiguous triples, schema inconsistencies, and noisy relation patterns, even when no language model is involved. As highlighted in recent surveys on knowledge graph reliability, errors may arise during entity extraction, relation prediction, schema alignment, or multi-source fusion, and can subsequently propagate to downstream reasoning tasks [63]. Consequently, replacing a manually curated knowledge base with an automatically generated concept graph, while advantageous for scalability, expands the potential surface for semantic noise. This limitation aligns with observations from the NAACL study, which emphasizes that neither standalone LLM reasoning nor automatically generated KGs are sufficient to guarantee factual correctness without additional constraints or validation mechanisms.

To mitigate these risks, prior research proposes several classes of approaches. First, retrieval-augmented generation (RAG) and KG-RAG incorporate external trusted sources to constrain model outputs and reduce free-form hallucination by grounding generation in explicit evidence [62]. Second, post-generation validation—such as consistency checking, majority-vote self-consistency, and logic-based filtering—can identify contradictions or unverifiable claims in model-generated content. Third, expert-in-the-loop frameworks such as the Expert Mental Model (EMM) approach demonstrate that structured human knowledge, even when partially elicited, can substantially reduce hallucination by guiding or verifying model reasoning steps. Finally, knowledge-graph–centric work emphasizes the importance of triple validation, conflict detection, cross-source verification, and uncertainty annotation to improve the reliability of automatically formed graphs [63].

Building on these insights, we integrate several complementary mitigation strategies into our concept-level augmentation framework. First, we adopt a confidence-weighted representation for each concept and concept–course link. LLM-native confidence scores (e.g., token-level log-probabilities), frequency of concept recurrence across multiple sampled generations, and semantic similarity to the original course description collectively define a confidence weight. Let

e_{c}

denote the embedding of a generated concept and

w_{c}

its confidence score. We use a rescaled embedding

e_{c}^{'} = α \cdot w_{c} \cdot e_{c}

during model training, where

α

is a tunable coefficient. Low-confidence concepts thus contribute minimally to the learned representation, while high-confidence concepts retain their influence. Second, we apply a multi-sample consensus filter to reduce idiosyncratic hallucinations. For each course, concepts are generated multiple times independently, and only those appearing with sufficient frequency are retained. This self-consistency mechanism follows recent findings that agreement across independent LLM samples correlates strongly with semantic reliability [62]. Frequency statistics can also be incorporated directly into the confidence weight

w_{c}

. Third, we incorporate text-grounded semantic verification. Each concept is compared with the original course description using embedding similarity, and concepts whose semantic distance exceeds a threshold are down-weighted or discarded. This grounding step ensures that retained concepts remain anchored in observable evidence rather than free-form LLM inference. Fourth, we apply simple structural validation to the generated concept graph. Cyclic dependencies, isolated nodes, and implausible links—common indicators of KG hallucination [63]—are pruned prior to integration. This regularization serves as a structural safeguard to suppress noisy or logically inconsistent relations. Finally, model performance itself provides an implicit behavioral regularizer: if low-confidence concepts fail to improve accuracy or degrade performance, their weights can be reduced or removed in subsequent iterations. Together, these mechanisms offer a practical pathway for reducing both LLM- and KG-induced hallucinations while preserving the benefits of concept-level semantic abstraction. Although we do not perform a standalone hallucination audit in this study, the proposed mechanisms are straightforward to incorporate into future iterations of our framework, and future work will investigate more advanced techniques—such as fine-grained fact-checking, human-in-the-loop refinement, and adaptive concept filtering—to further enhance reliability.

7. Conclusions

In this paper, we propose a lightweight and generalizable framework to enhance recommendation performance by integrating GPT-generated course concepts. Our approach enriches item representations with semantic signals derived from course descriptions, enabling more effective recommendations, especially in cold-start scenarios. By systematically evaluating multiple integration strategies across diverse models and datasets, we demonstrated that the incorporation of generated concepts leads to consistent performance improvements, with particularly pronounced gains for traditional collaborative filtering models and under sparse data conditions.

The framework is modular and efficient, requiring no modifications to existing model architectures and incurring minimal computational cost, as concept generation is performed offline. Extensive experiments on two large-scale MOOC datasets confirm the robustness of our method under both general and cold-start settings. Notably, the framework also holds practical value for real-world educational platforms, offering an effective solution for addressing new user and new course cold-start problems.

Our findings suggest that large language models can serve not only as generators of educational content but also as effective data augmenters for recommender systems. Through extensive experiments, we demonstrate that integrating LLM-generated concepts consistently enhances recommendation quality, particularly for traditional collaborative filtering and matrix factorization models that lack semantic signals. The benefit is most pronounced under sparse and cold-start conditions, where concept-level side information provides crucial contextual cues absent from interaction data. We also find that a minimal prompting design, using concise course titles without additional metadata, yields the most transferable and generalizable concept representations. These results indicate that LLMs can reliably transform unstructured educational text into structured semantic knowledge, enabling scalable and low-cost improvements to recommendation systems. This work, therefore, lays a concrete foundation for adaptive and semantically enriched educational recommendation technologies that can operate effectively even in data-scarce environments.

Author Contributions

Conceptualization, T.Y. and B.M.; methodology, T.Y.; software, B.R.; validation, T.Y., F.X. and C.G.; formal analysis, C.G.; investigation, F.X.; resources, T.Y.; data curation, C.G.; writing—original draft preparation, T.Y.; writing—review and editing, B.M. and S.K.; visualization, B.R.; supervision, B.M. and S.K.; project administration, B.M. and S.K.; funding acquisition, B.M. and S.K. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by JST SPRING, Grant Number JPMJSP2136, JSPS KAKENHI Grant Numbers JP20H00622 and JP24K20903.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

This study utilized publicly available datasets, including MOOCCube and XuetangX, which can be accessed at http://moocdata.cn/ (accessed on 14 October 2025). The GPT-generated course concepts and their associated embeddings used in this study are available from the corresponding author upon reasonable request for non-commercial research purposes.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Ma, B.; Lu, M.; Taniguchi, Y.; Konomi, S. CourseQ: The impact of visual and interactive course recommendation in university environments. Res. Pract. Technol. Enhanc. Learn. 2021, 16, 18. [Google Scholar] [CrossRef] [PubMed]
Ma, B.; Yang, T.; Ren, B. A Survey on Explainable Course Recommendation Systems. In Proceedings of the Distributed, Ambient and Pervasive Interactions; Streitz, N.A., Konomi, S., Eds.; Springer: Cham, Switzerland, 2024; pp. 273–287. [Google Scholar]
Zhang, J.; Hao, B.; Chen, B.; Li, C.; Chen, H.; Sun, J. Hierarchical reinforcement learning for course recommendation in MOOCs. Proc. AAAI Conf. Artif. Intell. 2019, 33, 435–442. [Google Scholar] [CrossRef]
Ma, H.; Zhu, J.; Yang, S.; Liu, Q.; Zhang, H.; Zhang, X.; Cao, Y.; Zhao, X. A prerequisite attention model for knowledge proficiency diagnosis of students. In Proceedings of the 31st ACM International Conference on Information & Knowledge Management, Atlanta, GA, USA, 17–21 October 2022; pp. 4304–4308. [Google Scholar]
Gao, W.; Liu, Q.; Huang, Z.; Yin, Y.; Bi, H.; Wang, M.C.; Ma, J.; Wang, S.; Su, Y. RCD: Relation map driven cognitive diagnosis for intelligent education systems. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, Virtual, 11–15 July 2021; pp. 501–510. [Google Scholar]
Yang, T.; Ren, B.; Ma, B.; He, T.; Gu, C.; Konomi, S. Boosting Course Recommendation Explainability: A Knowledge Entity Aware Model Using Deep Learning. In Proceedings of the 32nd International Conference on Computers in Education, ICCE2024, Quezon, Philippines, 25–29 November 2024; Asia-Pacific Society for Computers in Education: Taoyuan, Taiwan, 2024; pp. 360–366. [Google Scholar]
Yang, T.; Ren, B.; Ma, B.; Khan, M.A.Z.; He, T.; Konomi, S. Making Course Recommendation Explainable: A Knowledge Entity-Aware Model using Deep Learning. In Proceedings of the 17th International Conference on Educational Data Mining, Atlanta, GA, USA, 14–17 July 2024; pp. 658–663. [Google Scholar] [CrossRef]
Pan, L.; Wang, X.; Li, C.; Li, J.; Tang, J. Course concept extraction in moocs via embedding-based graph propagation. In Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Taipei, Taiwan, 27 November–1 December 2017; pp. 875–884. [Google Scholar]
Lu, M.; Wang, Y.; Yu, J.; Du, Y.; Hou, L.; Li, J. Distantly Supervised Course Concept Extraction in MOOCs with Academic Discipline. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Toronto, ON, Canada, 9–14 July 2023; Rogers, A., Boyd-Graber, J., Okazaki, N., Eds.; pp. 13044–13059. [Google Scholar] [CrossRef]
Yang, T.; Ren, B.; Gu, C.; Ma, B.; Konomi, S. Leveraging ChatGPT for Automated Knowledge Concept Generation. In Proceedings of the CELDA2024: International Conference on Cognition and Exploratory Learning in the Digital Age, Zagreb, Croatia, 26–28 October 2024; International Association for Development of the Information Society (IADIS): Lisbon, Portugal, 2024; pp. 75–82. [Google Scholar]
Ehara, Y. Measuring Similarity between Manual Course Concepts and ChatGPT-generated Course Concepts. In Proceedings of the 16th International Conference on Educational Data Mining, Bengaluru, India, 11–14 July 2023; pp. 474–476. [Google Scholar]
Yang, T.; Ren, B.; Gu, C.; He, T.; Ma, B.; Konomi, S. Leveraging LLMs for Automated Extraction and Structuring of Educational Concepts and Relationships. Mach. Learn. Knowl. Extr. 2025, 7, 103. [Google Scholar] [CrossRef]
Jing, X.; Tang, J. Guess you like: Course recommendation in MOOCs. In Proceedings of the International Conference on Web Intelligence, Leipzig, Germany, 23–26 August 2017; pp. 783–789. [Google Scholar]
Morsomme, R.; Alferez, S.V. Content-Based Course Recommender System for Liberal Arts Education. In Proceedings of the International Conference on Educational Data Mining (EDM 2019), Montreal, QC, Canada, 2–5 July 2019. [Google Scholar]
Morsy, S.; Karypis, G. Will this Course Increase or Decrease Your GPA? Towards Grade-aware Course Recommendation. J. Educ. Data Min. 2019, 11, 20–46. [Google Scholar]
Naren, J.; Banu, M.Z.; Lohavani, S. Recommendation system for students’ course selection. In Proceedings of the Smart Systems and IoT: Innovations in Computing: Proceeding of SSIC 2019; Springer: Singapore, 2020; pp. 825–834. [Google Scholar]
Chen, X.; Yin, C.; Chen, H.; Rong, W.; Ouyang, Y.; Chai, Y. Course Recommendation System Based on Course Knowledge Graph Generated by Large Language Models. In Proceedings of the 2024 IEEE International Conference on Teaching, Assessment and Learning for Engineering (TALE), Bengaluru, India, 9–12 December 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 1–8. [Google Scholar]
Polyzou, A.; Nikolakopoulos, A.N.; Karypis, G. Scholars Walk: A Markov Chain Framework for Course Recommendation. In Proceedings of the International Conference on Educational Data Mining (EDM 2019), Montreal, QC, Canada, 2–5 July 2019. [Google Scholar]
Wagner, K.; Merceron, A.; Sauer, P.; Pinkwart, N. Can the Paths of Successful Students Help Other Students with Their Course Enrollments? In Proceedings of the 16th International Conference on Educational Data Mining, Bengaluru, India, 11–14 July 2023; pp. 171–182. [Google Scholar] [CrossRef]
Gong, J.; Wang, S.; Wang, J.; Feng, W.; Peng, H.; Tang, J.; Yu, P.S. Attentional graph convolutional networks for knowledge concept recommendation in moocs in a heterogeneous view. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, Virtual, 25–30 July 2020; pp. 79–88. [Google Scholar]
Pardos, Z.A.; Fan, Z.; Jiang, W. Connectionist recommendation in the wild: On the utility and scrutability of neural networks for personalized course guidance. User Model. User-Adapt. Interact. 2019, 29, 487–525. [Google Scholar] [CrossRef]
Jiang, W.; Pardos, Z.A.; Wei, Q. Goal-based course recommendation. In Proceedings of the 9th International Conference on Learning Analytics & Knowledge, Tempe, AZ, USA, 4–8 March 2019; pp. 36–45. [Google Scholar]
Yu, J.; Wang, C.; Luo, G.; Hou, L.; Li, J.; Tang, J.; Huang, M.; Liu, Z. Expanrl: Hierarchical reinforcement learning for course concept expansion in MOOCs. In Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing, Suzhou, China, 4–7 December 2020; pp. 770–780. [Google Scholar]
Gao, M.; Luo, Y.; Hu, X. Online course recommendation using deep convolutional neural network with negative sequence mining. Wirel. Commun. Mob. Comput. 2022, 2022, 9054149. [Google Scholar] [CrossRef]
Yang, T.; Ren, B.; Gu, C.; Ma, B.; He, T.; Konomi, S. Towards better course recommendations: Integrating multi-perspective meta-paths and knowledge graphs. In Proceedings of the 15th International Learning Analytics and Knowledge Conference, Dublin, Ireland, 3–7 March 2025; pp. 137–147. [Google Scholar]
Foster, J.M.; Sultan, M.A.; Devaul, H.; Okoye, I.; Sumner, T. Identifying core concepts in educational resources. In Proceedings of the 12th ACM/IEEE-CS Joint Conference on Digital Libraries, Washington, DC, USA, 10–14 June 2012; pp. 35–42. [Google Scholar]
Manrique, R.; Grévisse, C.; Marino, O.; Rothkugel, S. Knowledge graph-based core concept identification in learning resources. In Proceedings of the Joint International Semantic Technology Conference; Springer: Cham, Switzerland, 2018; pp. 36–51. [Google Scholar]
Changuel, S.; Labroche, N.; Bouchon-Meunier, B. Resources Sequencing Using Automatic Prerequisite–Outcome Annotation. ACM Trans. Intell. Syst. Technol. (TIST) 2015, 6, 1–30. [Google Scholar] [CrossRef]
Yu, J.; Wang, C.; Luo, G.; Hou, L.; Li, J.; Liu, Z.; Tang, J. Course Concept Expansion in MOOCs with External Knowledge and Interactive Game. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 28 July–2 August 2019; pp. 4292–4302. [Google Scholar]
Reales, D.; Manrique, R.; Grévisse, C. Core Concept Identification in Educational Resources via Knowledge Graphs and Large Language Models. SN Comput. Sci. 2024, 5, 1029. [Google Scholar] [CrossRef]
Qiao, S.; Ou, Y.; Zhang, N.; Chen, X.; Yao, Y.; Deng, S.; Tan, C.; Huang, F.; Chen, H. Reasoning with Language Model Prompting: A Survey. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Toronto, ON, Canada, 9–14 July 2023; pp. 5368–5393. [Google Scholar] [CrossRef]
Wu, L.; Zheng, Z.; Qiu, Z.; Wang, H.; Gu, H.; Shen, T.; Qin, C.; Zhu, C.; Zhu, H.; Liu, Q.; et al. A survey on large language models for recommendation. World Wide Web 2024, 27, 60. [Google Scholar] [CrossRef]
Barany, A.; Nasiar, N.; Porter, C.; Zambrano, A.F.; Andres, A.L.; Bright, D.; Shah, M.; Liu, X.; Gao, S.; Zhang, J.; et al. ChatGPT for education research: Exploring the potential of large language models for qualitative codebook development. In Proceedings of the International Conference on Artificial Intelligence in Education, Recife, Brazil, 8–12 July 2024; Springer: Cham, Switzerland, 2024; pp. 134–149. [Google Scholar]
Lin, J.; Chen, E.; Han, Z.; Gurung, A.; Thomas, D.R.; Tan, W.; Nguyen, N.D.; Koedinger, K.R. How Can I Improve? Using GPT to Highlight the Desired and Undesired Parts of Open-ended Responses. In Proceedings of the 17th International Conference on Educational Data Mining, Atlanta, GA, USA, 14–17 July 2024; pp. 236–250. [Google Scholar]
Lohr, D.; Berges, M.; Chugh, A.; Kohlhase, M.; Müller, D. Leveraging Large Language Models to Generate Course-Specific Semantically Annotated Learning Objects. J. Comput. Assist. Learn. 2025, 41, e13101. [Google Scholar] [CrossRef]
Kieser, F.; Wulff, P.; Kuhn, J.; Küchemann, S. Educational data augmentation in physics education research using ChatGPT. Phys. Rev. Phys. Educ. Res. 2023, 19, 020150. [Google Scholar] [CrossRef]
Zhang, J.; Bao, K.; Zhang, Y.; Wang, W.; Feng, F.; He, X. Large language models for recommendation: Progresses and future directions. In Proceedings of the Companion Proceedings of the ACM Web Conference 2024, Singapore, 13–17 May 2024; pp. 1268–1271. [Google Scholar]
Rao, J.; Borchers, C.; Lin, J. Coursera-REC: Explainable MOOCs Course Recommendation Using RAG-Facilitated LLMs. 2024. Available online: https://osf.io/preprints/edarxiv/dnf7r_v1 (accessed on 22 June 2025).
Yu, X.; Mao, Q.; Wang, X.; Yin, Q.; Che, X.; Zheng, X. CR-LCRP: Course recommendation based on Learner–Course Relation Prediction with data augmentation in a heterogeneous view. Expert Syst. Appl. 2024, 249, 123777. [Google Scholar] [CrossRef]
Ding, K.; Xu, Z.; Tong, H.; Liu, H. Data augmentation for deep graph learning: A survey. ACM SIGKDD Explor. Newsl. 2022, 24, 61–77. [Google Scholar] [CrossRef]
Wang, J.; Le, Y.; Chang, B.; Wang, Y.; Chi, E.H.; Chen, M. Learning to augment for casual user recommendation. In Proceedings of the ACM Web Conference 2022, Lyon, France, 25–29 April 2022; pp. 2183–2194. [Google Scholar]
Huang, F.; Bei, Y.; Yang, Z.; Jiang, J.; Chen, H.; Shen, Q.; Wang, S.; Karray, F.; Yu, P.S. Large Language Model Simulator for Cold-Start Recommendation. In Proceedings of the Eighteenth ACM International Conference on Web Search and Data Mining, Hannover, Germany, 10–14 March 2025; pp. 261–270. [Google Scholar]
Wei, W.; Ren, X.; Tang, J.; Wang, Q.; Su, L.; Cheng, S.; Wang, J.; Yin, D.; Huang, C. Llmrec: Large language models with graph augmentation for recommendation. In Proceedings of the 17th ACM International Conference on Web Search and Data Mining, Merida, Mexico, 4–8 March 2024; pp. 806–815. [Google Scholar]
Wang, J.; Lu, H.; Caverlee, J.; Chi, E.H.; Chen, M. Large language models as data augmenters for cold-start item recommendation. In Proceedings of the Companion Proceedings of the ACM Web Conference 2024, Singapore, 13–17 May 2024; pp. 726–729. [Google Scholar]
Pennington, J.; Socher, R.; Manning, C. GloVe: Global Vectors for Word Representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, 25–29 October 2014; Moschitti, A., Pang, B., Daelemans, W., Eds.; pp. 1532–1543. [Google Scholar] [CrossRef]
Mikolov, T.; Yih, W.T.; Zweig, G. Linguistic regularities in continuous space word representations. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Atlanta, GA, USA, 9–14 June 2013; pp. 746–751. [Google Scholar]
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, MN, USA, 2–7 June 2019; Burstein, J., Doran, C., Solorio, T., Eds.; pp. 4171–4186. [Google Scholar] [CrossRef]
Bordes, A.; Usunier, N.; Garcia-Duran, A.; Weston, J.; Yakhnenko, O. Translating embeddings for modeling multi-relational data. Adv. Neural Inf. Process. Syst. 2013, 2, 2787–2795. [Google Scholar]
Lin, Y.; Liu, Z.; Sun, M.; Liu, Y.; Zhu, X. Learning entity and relation embeddings for knowledge graph completion. In Proceedings of the AAAI Conference on Artificial Intelligence, Austin, TX, USA, 25–30 January 2015; Volume 29. [Google Scholar]
Qiu, J.; Wang, S. Learning the concept embeddings of ontology. In Proceedings of the Advanced Data Mining and Applications: 16th International Conference, ADMA 2020, Foshan, China, 12–14 November 2020; Proceedings 16. Springer: Cham, Switzerland, 2020; pp. 127–134. [Google Scholar]
Conneau, A.; Kiela, D.; Schwenk, H.; Barrault, L.; Bordes, A. Supervised Learning of Universal Sentence Representations from Natural Language Inference Data. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Copenhagen, Denmark, 7–11 September 2017; pp. 670–680. [Google Scholar]
Cer, D.; Yang, Y.; Kong, S.Y.; Hua, N.; Limtiaco, N.; St. John, R.; Constant, N.; Guajardo-Cespedes, M.; Yuan, S.; Tar, C.; et al. Universal Sentence Encoder for English. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Brussels, Belgium, 31 October–4 November 2018; Blanco, E., Lu, W., Eds.; pp. 169–174. [Google Scholar] [CrossRef]
Reimers, N.; Gurevych, I. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, 3–7 November 2019; Inui, K., Jiang, J., Ng, V., Wan, X., Eds.; pp. 3982–3992. [Google Scholar] [CrossRef]
Yu, J.; Luo, G.; Xiao, T.; Zhong, Q.; Wang, Y.; Feng, W.; Luo, J.; Wang, C.; Hou, L.; Li, J.; et al. MOOCCube: A Large-scale Data Repository for NLP Applications in MOOCs. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; Jurafsky, D., Chai, J., Schluter, N., Tetreault, J., Eds.; pp. 3135–3142. [Google Scholar] [CrossRef]
Yang, Y.; Zhang, C.; Song, X.; Dong, Z.; Zhu, H.; Li, W. Contextualized Knowledge Graph Embedding for Explainable Talent Training Course Recommendation. ACM Trans. Inf. Syst. 2023, 42, 1–27. [Google Scholar] [CrossRef]
Frej, J.; Shah, N.; Knezevic, M.; Nazaretsky, T.; Käser, T. Finding Paths for Explainable MOOC Recommendation: A Learner Perspective. In Proceedings of the 14th Learning Analytics and Knowledge Conference, Kyoto, Japan, 18–22 March 2024; pp. 426–437. [Google Scholar]
Sarwar, B.; Karypis, G.; Konstan, J.; Riedl, J. Item-based collaborative filtering recommendation algorithms. In Proceedings of the 10th international conference on World Wide Web, Hong Kong, China, 1–5 May 2001; pp. 285–295. [Google Scholar]
Koren, Y.; Bell, R.; Volinsky, C. Matrix factorization techniques for recommender systems. Computer 2009, 42, 30–37. [Google Scholar] [CrossRef]
Kula, M. Metadata Embeddings for User and Item Cold-start Recommendations. In Proceedings of the 2nd Workshop on New Trends on Content-Based Recommender Systems Co-Located with 9th ACM Conference on Recommender Systems (RecSys 2015), Vienna, Austria, 16–20 September 2015; CEUR-WS.org, CEUR Workshop Proceedings. Bogers, T., Koolen, M., Eds.; Volume 1448, pp. 14–21. [Google Scholar]
He, X.; Liao, L.; Zhang, H.; Nie, L.; Hu, X.; Chua, T.S. Neural collaborative filtering. In Proceedings of the 26th International Conference on World Wide Web, Perth, Australia, 3–7 April 2017; pp. 173–182. [Google Scholar]
He, X.; Deng, K.; Wang, X.; Li, Y.; Zhang, Y.; Wang, M. Lightgcn: Simplifying and powering graph convolution network for recommendation. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, Virtual, 25–30 July 2020; pp. 639–648. [Google Scholar]
Agrawal, G.; Kumarage, T.; Alghamdi, Z.; Liu, H. Can Knowledge Graphs Reduce Hallucinations in LLMs?: A Survey. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), Mexico City, Mexico, 16–21 June 2024; Duh, K., Gomez, H., Bethard, S., Eds.; pp. 3947–3960. [Google Scholar] [CrossRef]
Kazlaris, I.; Antoniou, E.; Diamantaras, K.; Bratsas, C. From Illusion to Insight: A Taxonomic Survey of Hallucination Mitigation Techniques in LLMs. AI 2025, 6, 260. [Google Scholar] [CrossRef]

Figure 1. Overview of the proposed framework for integrating LLM-generated course concepts into recommendation models.

Figure 2. Prompt design for guiding LLM-based course concept generation across different datasets, consisting of task description, format indicator, and information injection.

Figure 3. Illustration of the three concept integration strategies: representation-level fusion, concatenation-based fusion, and structural-level fusion.

Figure 4. Performance comparison of baseline and concept-augmented models on the MOOCCube and XuetangX datasets. The Y-axis shows the absolute metric values (HR@10, NDCG@10, and MRR@10). The improvement percentages above each pair indicate the relative gains over the baseline.

Figure 5. Performance of integration strategies applied to MF and NeuMF models on the MOOCCube and XuetangX datasets, evaluated at K = 10 using HR@10, NDCG@10, and MRR@10.

Figure 6. Performance of integration strategies applied to FM and LightGCN models on the MOOCCube and XuetangX datasets, evaluated at K = 10 using HR@10, NDCG@10, and MRR@10.

Figure 7. Heatmap visualization of recommendation performance on the MOOCCube dataset, showing HR@10, NDCG@10, and MRR@10 scores for combinations of four LLMs (GPT3.5, GPT4omini, GPT4o, GPT4.1) and five prompt types (ZeroShot, OneShot, Description, Concept, AllInfo).

Figure 8. Performance comparison of different LLM models and prompt configurations on MOOCCube dataset.

Figure 9. Performance comparison of baseline and concept-augmented models under the cold-start setting at K = 10. The bars show absolute performance values, while the numbers above indicate the relative improvements (%) compared with the baseline.

Table 1. Statistics of the two datasets.

Dataset	MOOCCube	XuetangX
Users	33,838	35,753
Courses	645	764
Concepts	22,829	–
Interactions	263,522	301,269
Interaction Density	1.2074%	1.1029%

Table 2. Summary of side information integration strategies applied to each baseline model.

Model	Representation-Level	Concatenation-Based	Structure-Level
ItemKNN	✗	✗	similarity-weighted
MF	Linear, Gated, Attention	Concat	✗
FM	✗	One-hot, Cluster, DNN, Signal	✗
NeuMF	Linear, Gated, Attention	Concat	✗
LightGCN	Linear, Gated, Attention	Concat	graph augmentation
KEAM: utilizes a concept graph constructed from LLM-generated course concepts.

Table 3. Performance of Recommendation Models on MOOCCube Dataset at K = 5 and K = 10.

Model	K = 5					K = 10
Model	P@5	R@5	HR@5	NDCG@5	MRR@5	P@10	R@10	HR@10	NDCG@10	MRR@10
ItemKNN	0.0525	0.1464	0.2326	0.1118	0.1299	0.0344	0.1890	0.3005	0.1280	0.1391
w/Concept	0.0610	0.1725	0.2767	0.1324	0.1553	0.0472	0.2631	0.4018	0.1663	0.1717
MF	0.0341	0.0973	0.1617	0.0655	0.0724	0.0328	0.1839	0.2923	0.0989	0.0914
w/Concept	0.0753	0.2104	0.3251	0.1642	0.1930	0.0524	0.2890	0.4258	0.1935	0.2059
FM	0.0492	0.1442	0.2304	0.0950	0.1016	0.0419	0.2395	0.3651	0.1319	0.1209
w/Concept	0.0751	0.2097	0.3245	0.1639	0.1929	0.0531	0.2932	0.4320	0.1947	0.2065
NeuMF	0.0609	0.1719	0.2763	0.1269	0.1464	0.0469	0.2622	0.3978	0.1608	0.1622
w/Concept	0.0662	0.1862	0.2989	0.1413	0.1658	0.0511	0.2835	0.4269	0.1759	0.1792
LightGCN	0.0893	0.2470	0.3782	0.2012	0.2393	0.0625	0.3433	0.5014	0.2372	0.2556
w/Concept	0.1099	0.3071	0.4602	0.2447	0.2844	0.0767	0.4227	0.5948	0.2883	0.3024
KEAM	0.1136	0.3195	0.4756	0.2659	0.3156	0.0778	0.4323	0.6033	0.3083	0.3327
w/Concept	0.1162	0.3264	0.4836	0.2727	0.3229	0.0791	0.4387	0.6104	0.3150	0.3399

Table 4. Performance of Recommendation Models on XuetangX Dataset at K = 5 and K = 10.

Model	K = 5					K = 10
Model	P@5	R@5	HR@5	NDCG@5	MRR@5	P@10	R@10	HR@10	NDCG@10	MRR@10
ItemKNN	0.0456	0.1133	0.1634	0.0650	0.0548	0.0427	0.2280	0.2751	0.1079	0.0687
w/Concept	0.0858	0.2325	0.2865	0.1463	0.1248	0.0550	0.2983	0.3780	0.1710	0.1371
MF	0.0440	0.1141	0.1854	0.0798	0.0873	0.0396	0.2046	0.2987	0.1171	0.1074
w/Concept	0.0800	0.2035	0.2577	0.1680	0.1749	0.0514	0.2673	0.3425	0.1814	0.1694
FM	0.0398	0.1173	0.1784	0.0740	0.0739	0.0545	0.3083	0.3980	0.1465	0.1057
w/Concept	0.0328	0.0820	0.1543	0.0529	0.0577	0.0249	0.1229	0.2235	0.0704	0.0711
NeuMF	0.0271	0.0666	0.1203	0.0484	0.0576	0.0202	0.1039	0.1800	0.0614	0.0634
w/Concept	0.0447	0.1128	0.1893	0.0777	0.0852	0.0427	0.2308	0.3233	0.1178	0.0963
LightGCN	0.1016	0.2581	0.3338	0.2116	0.2272	0.0625	0.3433	0.5014	0.2372	0.2556
w/Concept	0.1229	0.3375	0.4273	0.2986	0.3184	0.0768	0.4161	0.5306	0.3283	0.3322
KEAM	0.0768	0.2018	0.2651	0.1605	0.1694	0.0495	0.2581	0.3439	0.1819	0.1797
w/Concept	0.1097	0.2939	0.3978	0.2443	0.2632	0.0699	0.3733	0.4990	0.2741	0.2767

Table 5. Performance of Recommendation Models on MOOCCube Dataset at K = 15 and K = 20.

Model	K = 15					K = 20
Model	P@15	R@15	HR@15	NDCG@15	MRR@15	P@20	R@20	HR@20	NDCG@20	MRR@20
ItemKNN	0.0290	0.2376	0.3666	0.1432	0.1444	0.0237	0.2570	0.3958	0.1488	0.1460
w/Concept	0.0401	0.3323	0.4873	0.1878	0.1784	0.0353	0.3869	0.5499	0.2031	0.1819
MF	0.0308	0.2565	0.3915	0.1222	0.1002	0.0279	0.3093	0.4587	0.1343	0.1000
w/Concept	0.0425	0.3506	0.5012	0.2122	0.2112	0.0361	0.3955	0.5516	0.2237	0.2120
FM	0.0358	0.3033	0.4475	0.1471	0.1189	0.0315	0.3510	0.5073	0.1635	0.1280
w/Concept	0.0427	0.3519	0.5015	0.2129	0.2119	0.0360	0.3943	0.5520	0.2243	0.2139
NeuMF	0.0393	0.3282	0.4767	0.1816	0.1690	0.0340	0.3763	0.5345	0.1945	0.1712
w/Concept	0.0425	0.3519	0.5105	0.1973	0.1858	0.0368	0.4035	0.5705	0.2118	0.1892
LightGCN	0.0504	0.4115	0.5809	0.2586	0.2618	0.0433	0.4692	0.6407	0.2748	0.2652
w/Concept	0.0609	0.4999	0.6744	0.3125	0.3087	0.0513	0.5578	0.7292	0.3288	0.3118
KEAM	0.0611	0.5044	0.6778	0.3310	0.3386	0.0510	0.5571	0.7268	0.3459	0.3413
w/Concept	0.0616	0.5093	0.6821	0.3371	0.3455	0.0515	0.5642	0.7335	0.3526	0.3484

Table 6. Performance of Recommendation Models on XuetangX Dataset at K = 15 and K = 20.

Model	K = 15					K = 20
Model	P@15	R@15	HR@15	NDCG@15	MRR@15	P@20	R@20	HR@20	NDCG@20	MRR@20
ItemKNN	0.0362	0.2907	0.3725	0.1273	0.0764	0.0296	0.3159	0.4114	0.1344	0.0786
w/Concept	0.0425	0.3425	0.4437	0.1849	0.1422	0.0362	0.3884	0.5068	0.1977	0.1458
MF	0.0321	0.2530	0.3624	0.1312	0.1109	0.0283	0.2943	0.4105	0.1421	0.1127
w/Concept	0.0407	0.3140	0.4120	0.1893	0.1677	0.0351	0.3607	0.4702	0.2067	0.1747
FM	0.0441	0.3628	0.4506	0.1651	0.1101	0.0365	0.3987	0.4947	0.1741	0.1114
w/Concept	0.0258	0.1983	0.3088	0.0982	0.0832	0.0223	0.2280	0.3467	0.1046	0.0836
NeuMF	0.0180	0.1397	0.2346	0.0736	0.0699	0.0206	0.2118	0.3238	0.0938	0.0746
w/Concept	0.0394	0.3130	0.4109	0.1457	0.1096	0.0332	0.3523	0.4648	0.1567	0.1126
LightGCN	0.0504	0.4115	0.5809	0.2586	0.2618	0.0433	0.4692	0.6407	0.2748	0.2652
w/Concept	0.0577	0.4646	0.5925	0.3437	0.3371	0.0474	0.5045	0.6397	0.3551	0.3397
KEAM	0.0381	0.2970	0.3967	0.1941	0.1839	0.0317	0.3285	0.4395	0.2031	0.1863
w/Concept	0.0533	0.4241	0.5614	0.2901	0.2816	0.0439	0.4624	0.6067	0.3010	0.2842

Table 7. Cold-Start Recommendation Performance on MOOCCube Dataset at K = 5, 10, 15, and 20.

Model	K = 5			K = 10			K = 15			K = 20
Model	HR	NDCG	MRR	HR	NDCG	MRR	HR	NDCG	MRR	HR	NDCG	MRR
ItemCF	0.0614	0.0570	0.0556	0.1933	0.0971	0.0707	0.2495	0.1118	0.0750	0.3086	0.1259	0.0783
w/Concept	0.2576	0.1992	0.1800	0.3333	0.2238	0.1902	0.3855	0.2375	0.1943	0.4283	0.2477	0.1967
MF	0.1064	0.0618	0.0473	0.1996	0.0902	0.0577	0.2824	0.1129	0.0652	0.3450	0.1266	0.0674
w/Concept	0.2363	0.1519	0.1242	0.3242	0.1800	0.1356	0.3811	0.1947	0.1396	0.4302	0.2069	0.1431
FM (BPR)	0.2259	0.1442	0.1171	0.3012	0.1689	0.1276	0.3494	0.1809	0.1303	0.3765	0.1862	0.1308
w/Concept	0.2363	0.1522	0.1245	0.3242	0.1801	0.1357	0.3859	0.1964	0.1406	0.4303	0.2069	0.1431
NCF	0.2172	0.1371	0.1107	0.3140	0.1690	0.1244	0.3763	0.1848	0.1284	0.4248	0.1962	0.1311
w/Concept	0.2228	0.1439	0.1179	0.3271	0.1775	0.1317	0.3923	0.1948	0.1369	0.4438	0.2069	0.1398
LightGCN	0.3062	0.2128	0.1821	0.3984	0.2406	0.1921	0.4631	0.2577	0.1971	0.5184	0.2707	0.2002
w/Concept	0.3687	0.2534	0.2154	0.4796	0.2893	0.2303	0.5510	0.3083	0.2359	0.6028	0.3205	0.2388
KEAM	0.3398	0.2585	0.2315	0.4095	0.2811	0.2409	0.4446	0.2904	0.2436	0.4689	0.2962	0.2450
w/Concept	0.3480	0.2656	0.2382	0.4182	0.2884	0.2477	0.4529	0.2976	0.2505	0.4777	0.3035	0.2519

Table 8. Cold-Start Recommendation Performance on XuetangX Dataset at K = 5, 10, 15, and 20.

Model	K = 5			K = 10			K = 15			K = 20
Model	HR	NDCG	MRR	HR	NDCG	MRR	HR	NDCG	MRR	HR	NDCG	MRR
ItemCF	0.1542	0.0910	0.0704	0.2531	0.1245	0.0851	0.2981	0.1364	0.0886	0.3407	0.1464	0.0910
w/Concept	0.2681	0.1671	0.1337	0.3709	0.2005	0.1476	0.4350	0.2175	0.1526	0.4981	0.2324	0.1562
MF	0.1014	0.0600	0.0465	0.1720	0.0819	0.0549	0.2339	0.0976	0.0589	0.2807	0.1080	0.0607
w/Concept	0.1474	0.0890	0.0699	0.2159	0.1117	0.0797	0.2401	0.1052	0.0660	0.3062	0.1344	0.0876
FM (BPR)	0.0632	0.0439	0.0375	0.1322	0.0761	0.0590	0.1645	0.0911	0.0690	0.1929	0.1004	0.0737
w/Concept	0.1140	0.0737	0.0604	0.1551	0.0782	0.0546	0.2036	0.0931	0.0604	0.2491	0.1090	0.0716
NCF	0.1110	0.0694	0.0558	0.1440	0.0806	0.0610	0.1889	0.0913	0.0628	0.2405	0.1030	0.0653
w/Concept	0.1108	0.0692	0.0557	0.1902	0.0947	0.0661	0.2555	0.1120	0.0712	0.3113	0.1252	0.0744
LightGCN	0.3683	0.2664	0.2327	0.4565	0.2914	0.2407	0.5303	0.3110	0.2466	0.5749	0.3215	0.2491
w/Concept	0.4300	0.3329	0.3037	0.5437	0.3694	0.3155	0.6163	0.3886	0.3212	0.6656	0.4002	0.3240
KEAM	0.1530	0.1043	0.0883	0.2317	0.1302	0.0992	0.2834	0.1438	0.1033	0.3261	0.1539	0.1057
w/Concept	0.4087	0.3103	0.2777	0.5103	0.3431	0.2911	0.5681	0.3584	0.2957	0.6074	0.3677	0.2979

Table 9. Comparison of Top-10 Recommendation Performance between General and Cold-Start Settings on XuetangX Dataset.

Model	General Recommendation			Cold-Start Recommendation
Model	HR@10	NDCG@10	MRR@10	HR@10	NDCG@10	MRR@10
ItemKNN	0.2751	0.1079	0.0687	0.2531	0.1245	0.0851
w/Concept	0.3780	0.1710	0.1371	0.3709	0.2005	0.1476
MF	0.2987	0.1171	0.1074	0.1720	0.0819	0.0549
w/Concept	0.3425	0.1814	0.1694	0.2159	0.1117	0.0797
FM (BPR)	0.3980	0.1465	0.1057	0.1322	0.0761	0.0590
w/ Concept	0.2235	0.0704	0.0711	0.1551	0.0782	0.0546
NeuMF	0.1800	0.0614	0.0634	0.1440	0.0806	0.0610
w/Concept	0.3233	0.1178	0.0963	0.1902	0.0947	0.0661
LightGCN	0.5014	0.2372	0.2556	0.4565	0.2914	0.2407
w/Concept	0.5306	0.3283	0.3322	0.5437	0.3694	0.3155
KEAM	0.3439	0.1819	0.1797	0.2317	0.1302	0.0992
w/Concept	0.4990	0.2741	0.2767	0.5103	0.3431	0.2911

Table 10. Comparison of Top-10 Recommendation Performance between General and Cold-Start Settings on MOOCCube Dataset.

Model	General Recommendation			Cold-Start Recommendation
Model	HR@10	NDCG@10	MRR@10	HR@10	NDCG@10	MRR@10
ItemKNN	0.3005	0.1280	0.1391	0.1933	0.0971	0.0707
w/Concept	0.4018	0.1663	0.1717	0.3333	0.2238	0.1902
MF	0.2923	0.0989	0.0914	0.1996	0.0902	0.0577
w/Concept	0.4258	0.1935	0.2059	0.3242	0.1800	0.1356
FM (BPR)	0.3651	0.1319	0.1209	0.3012	0.1689	0.1276
w/Concept	0.4320	0.1947	0.2065	0.3242	0.1801	0.1357
NeuMF	0.3978	0.1608	0.1622	0.3140	0.1690	0.1244
w/Concept	0.4269	0.1759	0.1792	0.3271	0.1775	0.1317
LightGCN	0.5014	0.2372	0.2556	0.3984	0.2406	0.1921
w/Concept	0.5948	0.2883	0.3024	0.4796	0.2893	0.2303
KEAM	0.6033	0.3083	0.3327	0.4095	0.2811	0.2409
w/ Concept	0.6104	0.3150	0.3399	0.4182	0.2884	0.2477

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Enhancing Course Recommendation with LLM-Generated Concepts: A Unified Framework for Side Information Integration

Abstract

1. Introduction

2. Related Work

2.1. Course Recommendation

2.2. Concept Extraction

2.3. Large Language Models in Education

2.4. LLM-Based Data Augmentation in Recommender Systems

3. Proposed Framework

3.1. Overview of the Framework

3.2. LLM-Based Concept Generation & Embedding

3.2.1. Concept Generation with LLMs

3.2.2. Semantic Embedding of Concepts

3.3. Side Information Integration Strategies

3.3.1. Representation-Level Fusion

3.3.2. Concatenation-Based Fusion

3.3.3. Structural Fusion

4. Experimental Setup

4.1. Datasets

4.2. Evaluation Metrics

4.3. Baselines

4.4. Training Details

4.5. Model-Specific Integration Settings

5. Results

5.1. Overall Performance Comparison

5.2. Effectiveness of Different Integration Strategies

5.3. Influence of Different LLMs

5.4. Cold-Start Scenario Recommendation

6. Discussion

Hallucination Risks and Mitigation Strategies

7. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Article Metrics

Citations

Article Access Statistics