A Structure-Aware and Attention-Enhanced Explainable Learning Resource Recommendation Approach for Smart Education Within Smart Cities

Bu, Tianxue; Zheng, Hao; Zhao, Fen

doi:10.3390/electronics14234561

Open AccessArticle

A Structure-Aware and Attention-Enhanced Explainable Learning Resource Recommendation Approach for Smart Education Within Smart Cities

by

Tianxue Bu

^1,2,

Hao Zheng

^1,2,* and

Fen Zhao

^2,*

¹

School of Artificial Intelligence and Information Technology, Nanjing University of Chinese Medicine, Nanjing 210023, China

²

School of Artificial Intelligence, Nanjing Xiaozhuang University, Nanjing 211171, China

^*

Authors to whom correspondence should be addressed.

Electronics 2025, 14(23), 4561; https://doi.org/10.3390/electronics14234561

Submission received: 16 October 2025 / Revised: 13 November 2025 / Accepted: 18 November 2025 / Published: 21 November 2025

(This article belongs to the Special Issue Advances in AI-Augmented E-Learning for Smart Cities)

Download

Browse Figures

Versions Notes

Abstract

With the rapid advancement in smart city infrastructures, the demand for personalized and explainable educational services has become increasingly prominent. To address the challenges of information overload and the lack of interpretability in traditional learning resource recommendation, this paper proposes a Structure-aware and Attention-enhanced explainable learning resource Recommendation approach (StAR) for smart education. StAR constructs a reinforcement learning framework grounded in a knowledge graph to model learner–resource interactions. First, a multi-head attention mechanism encodes path states and extracts key semantic features, enhancing the model’s ability to represent complex learning contexts. Then, a dual-layer action pruning strategy compresses the action space and improves reasoning efficiency. Finally, a structure-aware reward function guides the generation of semantically coherent and interpretable recommendation paths. Experiments on two real-world educational datasets, COCO and MoocCube, demonstrate that StAR outperforms several baseline models, achieving improvements of 14.2% and 12.6% in NDCG and Recall on COCO, and 5.2% and 4.2% on MoocCube, respectively. The results validate the effectiveness of StAR in enhancing recommendation accuracy, reasoning efficiency, and interpretability, offering a promising AI-enhanced solution for personalized learning in smart cities.

Keywords:

smart education; smart cities; e-learning; knowledge graph; reinforcement learning; explainable AI

1. Introduction

With the rapid advancement in information technology, online education has become an integral component of the digital transformation in smart cities, creating new opportunities for promoting educational equity and lifelong learning [1]. However, the proliferation of online learning resources often leads to “information overload”, making the intelligent and personalized recommendation of learning materials a critical challenge in the context of smart education [2]. Unlike conventional recommendation scenarios, learning resource recommendation requires a multi-dimensional evaluation framework. It must consider learners’ knowledge backgrounds, learning preferences, and prerequisite course relationships to achieve truly adaptive instruction [3]. Despite the growth in online education platforms, many still employ standardized models that fail to address the increasing demand for personalization and diversity [4]. Therefore, the development of interpretable and intelligent recommendation methods is essential—not only to alleviate information overload but also to foster personalized learning and support the digital transformation of education in smart cities [5].

In recent years, collaborative filtering (CF), content-based filtering (CBF), and sequence-based methods have been widely adopted in the recommendation domain, but these methods struggle to effectively capture learners’ knowledge backgrounds, learning goals, and prerequisite relationships among courses in educational settings, and they also lack interpretability, thereby reducing the credibility of the recommendations [6,7,8]. To address these issues, researchers have introduced knowledge graphs, leveraging knowledge representation learning and graph neural networks to capture high-order associations within the graph, thereby improving the accuracy and interpretability of recommendations [9]. For instance, Xian et al. [10] proposed the Policy-Guided Path Reasoning (PGPR) model, which utilizes multi-hop path reasoning on knowledge graphs to generate interpretable decision paths for personalized recommendations. Subsequently, Frej et al. [11] extended this approach by proposing the Unrestricted Policy-Guided Path Reasoning (UPGPR), which removes manual constraints on reasoning paths and introduces a refined reward mechanism to expand the exploration space and improve cross-scenario generalization. Moreover, due to the large number of entities and complex relations in knowledge graphs, reinforcement learning (RL) methods often encounter an overwhelming action space during path exploration, necessitating efficient action pruning strategies to enhance model efficiency [12,13]. Meanwhile, reward signals in multi-hop path reasoning are often sparse, making it a critical research challenge to effectively leverage the graph’s structural information to provide meaningful intermediate feedback for guiding decision-making [14,15].

To address the aforementioned challenges, this paper proposes an explainable learning resource recommendation method—Structure-aware and Attention-enhanced Recommendation (StAR), an AI-augmented approach designed for smart education in smart cities. The proposed method enhances state representation via multi-head attention (MHA), regulates the action space through a dual-layer action pruning strategy, and designs a structure-aware reward function to guide path exploration, thereby significantly improving both the accuracy and interpretability of recommendation outcomes.

The main contributions of this paper are as follows:

We employ MHA to encode path states, effectively extracting critical semantic features. This enhances the agent’s ability to represent complex environments and addresses the limitations of existing models in state representation.
We design a dual-layer action pruning strategy, applied at both the environment and policy levels, to compress the action space and maintain reasoning efficiency without sacrificing recommendation quality.
We develop a structure-aware reward function that integrates structural and semantic information from the knowledge graph, alleviating reward sparsity and steering the agent toward logically consistent and interpretable paths.
We conduct extensive experiments on two real-world educational datasets, COCO and MoocCube, and demonstrate that StAR outperforms several strong baselines across multiple evaluation metrics, confirming its effectiveness for learning resource recommendation in smart education scenarios.

2. Related Work

Traditional recommendation methods, such as CF, CBF, and sequence-based approaches, have achieved remarkable success in domains like e-commerce and multimedia services [7,16,17]. However, their direct application to educational contexts exposes several fundamental limitations. For instance, Sarwar et al. [18] introduced the Item-based Collaborative Filtering (Item-CF) method, which effectively leverages historical user-item interaction data for recommendations, but its performance heavily depends on data richness and tends to degrade in educational settings characterized by sparse user behavior. The CBF model proposed by Pazzani et al. [19] typically relies on shallow features such as course titles and descriptions, making it difficult to accurately model the intricate knowledge structures and prerequisite dependencies among courses. Moreover, traditional recommendation techniques generally lack transparency and fail to provide explanations aligned with educational reasoning. For example, Sinha et al. [20] demonstrated that the absence of clear recommendation rationales and logical justification significantly undermines user trust in the system and negatively affects learners’ willingness to accept and adopt the recommendations. In addition, regarding sequential recommendation, Rendle et al.’s [21] Factorizing Personalized Markov Chains (FPMC) model effectively captures short-term user behavior sequences but fails to accurately reflect learners’ long-term learning goals and path consistency, thereby limiting its practical value in educational contexts. Consequently, these conventional approaches are unable to simultaneously meet the dual demands of recommendation accuracy and educational interpretability in smart, personalized e-learning scenarios.

To address the limitations of traditional methods, researchers have begun incorporating knowledge graphs into recommendation systems to enhance the accuracy and interpretability of recommendation results. Knowledge graph-based recommendation methods are mainly categorized into embedding-based and path-based approaches. Embedding-based methods utilize knowledge graph embedding techniques (e.g., TransE [22]) to represent entities and relations as low-dimensional vectors, thereby integrating semantic information. For example, Zhang et al. [23] proposed a collaborative knowledge base embedding method (CKE), which integrates structured knowledge, textual content, and image information to enhance user and item representation learning. Wang et al. [24] proposed KGAT, a recommendation model that incorporates attention mechanisms into knowledge graphs. It captures high-order connectivity via recursive propagation and uses attention to weigh different relations, breaking the independence assumption between users and items and significantly improving recommendation accuracy. However, these methods often treat knowledge graphs as auxiliary information sources and fail to fully exploit their rich structural features and semantic associations, resulting in limited model interpretability. In contrast, path-based methods aim to improve both accuracy and interpretability by mining connection paths within the knowledge graph, such as Wang et al.’s [25] Knowledge-aware Path Recurrent Network (KPRN). Although these methods provide strong interpretability, they often depend on manually designed path templates, lacking flexibility and adaptability to dynamic user preferences. To overcome these limitations, recent research has explored automated path mining and RL techniques to enable more flexible and efficient recommendation strategies. These approaches are particularly promising for smart education systems, where personalized and explainable learning pathways are essential.

Distinct from the aforementioned approaches, this work focuses on the challenges of learning resource recommendation within complex educational knowledge graph structures. To address the issues of insufficient state representation, expansive action space, and sparse reward signals, we propose StAR, a Structure-aware and Attention-enhanced Explainable Recommendation framework. By jointly optimizing state modeling, action pruning, and reward design, StAR establishes a reinforcement learning-based mechanism that balances interpretability and reasoning efficiency. This framework contributes to the development of AI-augmented, explainable, and personalized e-learning systems that align with the vision of smart education in smart cities.

3. Methodology

To address three limitations in existing recommenders—insufficient state modeling, expansive action space, and sparse rewards—we present StAR, an explainable learning-resource recommendation framework for smart education. StAR leverages a structured knowledge graph that encodes semantic relationships among learners, resources, and concepts, thereby enabling interpretable, adaptive recommendations aligned with smart-city educational ecosystems. The core idea is to train an RL agent to navigate the knowledge-graph environment so that recommendation becomes a transparent path-reasoning process. Specifically, we formulate the task as a Markov decision process (MDP) with state, action, reward, and policy components [26,27].

As illustrated in Figure 1, the specific workflow of StAR is as follows:

Environment construction and state initialization. The educational knowledge graph (entities and relations) defines the RL environment; each episode initializes the state from the target learner’s context.
State representation optimization. An MHA module captures long-range dependencies in the path history and highlights salient semantics, yielding a more expressive state.
Action space pruning. To mitigate action-space explosion, we first perform environment-level pruning using TransE similarities and stochastic dropout to form an exploratory candidate set ${\tilde{A}}_{t} (u)$ , and then apply policy-level Top-K pruning over Actor–Critic scores to obtain the executable set ${\hat{A}}_{t}$ .
Reward function design and path exploration. We combine node centrality, path relevance, final-node similarity, and path novelty, plus a mild self-loop penalty, to provide dense, informative guidance for coherent path generation.

The agent, based on the optimized action set and enhanced state representation, continuously selects optimal actions under the guidance of the structure-aware reward mechanism, gradually constructing semantically coherent recommendation paths to achieve accurate learning resource recommendations. Meanwhile, the generated recommendation paths can be traced back to provide explicit recommendation rationales, significantly improving transparency and user trust.

3.1. Task Definition

In the explainable learning resource recommendation task investigated in this paper, we define the set of users as

U = {u_{1}, u_{2}, \dots, u_{| U |}}

and the set of courses as

C = {c_{1}, c_{2}, \dots, c_{| C |}}

, where

| U |

and

| C |

denote the number of users and courses, respectively. To support the structured representation of learning resources, we construct a knowledge graph

G = (E, R)

, where each edge is represented as a triple

(h, r, t)

, with

h, t \in E

representing entities and

r \in R

denoting the relation type. The knowledge graph

G

comprises six types of entities and five types of relations, as shown in Table 1. We employ the TransE method for embedding learning to provide semantic support for state representation and action pruning. In this environment, the recommendation task is modeled as a Markov Decision Process (MDP) defined as follows:

State: The state at time t is represented as $S_{t} = (u, e_{t}, h_{t})$ , where $u \in U$ denotes the user, $e_{t} \in E$ represents the current entity, and $h_{t}$ encapsulates the historical context.
Action Space: The action space at time t is defined as $A_{t} = {(r_{i}, e_{i}^{'})}$ , where $r_{i} \in R$ is a relation type and $e_{i}^{'} \in E$ is the next entity.
Reward: The reward function $R_{t}$ is designed to guide the agent toward high-quality path exploration.
Policy: The policy $π (a_{t} ∣ S_{t})$ governs the selection of action $a_{t} \in A_{t}$ given the state $S_{t}$ .

The objective is to learn an optimal policy

π^{*}

, such that the reasoning path from the user node leads to the accurate recommendation of a course

c \in C

, while maintaining strong interpretability.

3.2. State Representation Optimization Based on Multi-Head Attention

In RL-based path reasoning tasks, the quality of state representation directly affects the decision-making performance of the policy network [28,29]. Traditional approaches often rely on single-head attention or simple vector concatenation for state modeling, which can capture local information but often fail to model long-range dependencies and diverse semantic associations within paths, thereby limiting their ability to fully encode contextual state information. To address this, we introduce a state representation optimization method based on MHA to capture long-range dependencies within the state vector and more accurately represent the learner’s current status. MHA enables parallel extraction of semantic features from different dimensions of the state across multiple subspaces, and assigns differentiated attention weights to key nodes in historical paths, alleviating problems of information redundancy and uniform weighting. Compared with single-head attention, the multi-head mechanism offers advantages in modeling semantic diversity and improving model stability, making it well-suited for complex state expression in knowledge graph-based path reasoning. Specifically, the MHA computation is defined as

MultiHead (Q, K, V) = Concat ({head}_{1}, \dots, {head}_{h}) W^{o}

(1)

where Q, K, and V denote the query, key, and value matrices, respectively, h is the number of heads, and

W^{O}

is the output projection matrix. Each attention head is computed based on the following expression:

{head}_{i} = Attention (Q W_{i}^{Q}, K W_{i}^{K}, V W_{i}^{V})

(2)

The attention function, where

d_{k}

represents the dimension of the key vectors, is defined accordingly as

Attention (Q, K, V) = softmax (\frac{Q K^{T}}{\sqrt{d_{k}}}) V

(3)

As illustrated in Figure 2, integrating user embeddings, the current node, and historical path information, we construct the state representation vector

S_{t}

, which is then encoded using MHA. The MHA captures key features within the state

S_{t}^{'}

, with each attention head independently processing distinct semantic subspace features of the state input. This approach effectively captures critical nodes and contextual dependencies within the path, resulting in an enhanced state representation

S_{t}^{'}

, which more accurately reflects the learner’s state characteristics. This process is mathematically expressed as

S_{t}^{'} = MultiHead (S_{t} W^{Q}, S_{t} W^{K}, S_{t} W^{V})

(4)

3.3. Dual-Layer Action Pruning Strategy

In knowledge graph path recommendation, the large number of node connections leads to an expansive action space, making it hard for RL agents to explore high-quality paths efficiently [30,31]. To address this issue, we propose a dual-layer action pruning strategy that operates in coordination at both the environment and policy levels, with the pseudocode implementation shown in Algorithm 1.

This approach adopts a “coarse-to-fine” process to effectively reduce the action space and improve decision quality: the environment layer filters semantically relevant actions to form an exploratory candidate set; the policy layer then uses a policy network to further select high-quality actions with strong discriminative power. These two layers work together to improve search efficiency and significantly boost both recommendation accuracy and interpretability.

Environment-level action pruning: The goal of environment-level pruning is to preliminarily reduce the action space and improve the efficiency of path search. First, previously visited nodes are removed to prevent path loops and redundant transitions. Then, node embeddings learned via the TransE model are employed to compute the semantic similarity between the current node and each candidate node:

$sim (i, j) = cos (h_{i}, h_{j})$

(5)

where $h_{i}$ and $h_{j}$ represent the embedding vectors of node i and node j, respectively. Then, the top-N nodes are selected based on their similarity scores, and a Dropout mechanism is applied to randomly sample from them, forming a semantically relevant and exploratory initial action set ${\tilde{A}}_{t} (u)$ .
Policy-level action pruning: Policy-level pruning further enhances the precision of action selection. An Actor-Critic policy network is employed to score the initial candidate action set:

$s_{i} = f_{θ} (S_{t}, a_{i}), π (a_{i} ∣ S_{t}) = softmax (s_{i})$

(6)

${\hat{A}}_{t} = {TopK}_{a_{i} \in {\tilde{A}}_{t} (u)} (π (a_{i} ∣ S_{t}))$

(7)

where $S_{t}$ represents the current state, $f_{θ} (\cdot)$ denotes the policy network (Actor) that computes the state-action score function, and $π (a_{i} ∣ S_{t})$ indicates the probability of selecting action $a_{i}$ under state $S_{t}$ . Ultimately, the top K actions with the highest probabilities are selected to form the final executable action set ${\hat{A}}_{t}$ , thereby effectively reducing interference from low-value actions.

Algorithm 1 Pseudocode of the double-layer action pruning strategy.
Require: User embedding u, current state $S_{t}$ , current node $e_{t}$ , candidate action set $A_{t}$ , historical path $h_{t}$ , entity embedding e, pruning threshold N, dropout probability $ρ$ , policy network $f_{θ}$ , number of actions to retain K
Ensure: Final executable action set ${\hat{A}}_{t}$
// Environment Layer Pruning
1: ${\tilde{A}}_{t} (u) \leftarrow \emptyset$
2: $L \leftarrow \emptyset$	▹ Temporary list
3: for $(r_{i}, e_{i}^{'}) \in A_{t}$ do
4: if $e_{i}^{'} \in h_{t}$ then
5: continue	▹ Remove historical nodes
6: end if
7: $s_{i} \leftarrow cos (u, e [e_{i}^{'}])$	▹ Compute semantic similarity
8: $L \leftarrow L \cup {(r_{i}, e_{i}^{'}, s_{i})}$
9: end for
10: $A_{t}^{'} \leftarrow top - N actions from L sorted by s_{i} (descending)$
11: for $(r_{j}, e_{j}^{'}, s_{j}) \in A_{t}^{'}$ do
12: if rand() > $ρ$ then
13: ${\tilde{A}}_{t} (u) \leftarrow {\tilde{A}}_{t} (u) \cup {(r_{j}, e_{j}^{'})}$	▹ Randomly retain to form preliminary action set
14: end if
15: end for
// Policy Layer Pruning
16: $P \leftarrow \emptyset$	▹ Action score list
17: for $a_{i} \in {\tilde{A}}_{t} (u)$ do
18: $s_{i} \leftarrow f_{θ} (S_{t}, a_{i})$	▹ Policy network scoring
19: $P \leftarrow P \cup {(a_{i}, s_{i})}$
20: end for
21: Apply softmax normalization to scores in P, obtaining $π (a_{i} ∣ S_{t})$
22: ${\hat{A}}_{t} \leftarrow top - K actions from P based on π (a_{i} ∣ S_{t})$
23: return ${\hat{A}}_{t}$

3.4. Design of Structure-Aware Reward Function

In RL-based learning resource recommendation, the reward function critically affects both recommendation accuracy and exploration efficiency [32,33]. Existing methods typically use cosine similarity between the user and the final item as the reward signal, neglecting the structural and semantic information inherent in the knowledge graph, which limits the interpretability and precision of the recommendation. To address this, we propose a structure-aware multi-dimensional reward function, which comprehensively incorporates the structural properties and semantic relations of the knowledge graph, to more effectively guide path generation and policy learning.

As illustrated in Figure 3, the structure-aware reward function consists of four complementary dimensions: Node centrality reward encourages the agent to prioritize nodes with strong information aggregation ability, thereby enhancing the representativeness and structural coverage of the path. Path relevance reward emphasizes semantic coherence between nodes to improve logical consistency and interpretability. The final node similarity reward ensures a high degree of alignment between the recommended target and the user’s interests, thereby achieving personalized recommendations. Path novelty reward encourages the exploration of low-frequency entities to improve the diversity and knowledge expansion of recommendations. In addition, a self-loop penalty mechanism is introduced as an auxiliary term to penalize repeated visits and reduce the generation of redundant paths.

Node Centrality Reward

R_{c}

. The node centrality reward is designed to encourage the agent to prioritize traversing high-value nodes that occupy critical positions in the knowledge graph. In a knowledge graph, these central nodes are often highly connected and aggregate key information, such as core concepts shared across multiple popular courses. This method adopts node degree as the metric for measuring centrality, and applies logarithmic normalization to suppress extreme values, as defined below:

R_{c} = \sum_{e_{i} \in P, r_{i} \neq r_{0}} min (1, \frac{log (1 + d (e_{i}))}{10})

(8)

where

P

denotes the set of nodes visited prior to the current node in the path,

d (e_{i})

represents the degree of node

e_{i}

, and

r_{i} \neq r_{0}

indicates that self-loops are excluded.

Path Relevance Reward

R_{r}

. This reward is used to measure the semantic coherence between adjacent nodes in the path. Semantically coherent paths tend to be more logical and interpretable. For example, a path from “Linear Algebra” to “Matrix Decomposition” is more logical than one linking to “Literary Appreciation.” The reward is defined as

R_{r} = \sum_{i = 0}^{T - 2} I (r_{i + 1} \neq r_{0}) \cdot max (0, cos (e_{i}, e_{i + 1}))

(9)

where T represents the total number of nodes in the path, and

e_{i}

denotes the embedding of the i-th node in the path.

Final Node Similarity Reward

R_{f}

.

R_{f}

is designed to evaluate the alignment between the recommended course and the user’s interests, and serves as a key indicator for achieving personalized recommendations. It calculates the cosine similarity between the user embedding and the course embedding, with a lower bound to avoid negative influence, defined as

R_{f} = max (0, cos (u + r_{inter}, P))

(10)

where u represents the user embedding,

r_{inter}

denotes the interaction embedding, and P indicates the course embedding. If the recommended course p appears in the training interaction records of user u, the reward is further multiplied by a weighting factor to enhance the model’s memory of historical preferences, resulting in the following formulation:

R_{f} = \{\begin{matrix} max (0, cos (u + r_{inter}, p)) \cdot 1.2, & if p \in T_{train} (u) \\ max (0, cos (u + r_{inter}, p)), & otherwise \end{matrix}

(11)

Path Novelty Reward

R_{n}

. To avoid over-dependence on central nodes and reduce path homogeneity, we introduce a path novelty reward, encouraging the agent to explore low-degree and less popular nodes, thereby improving both diversity and knowledge coverage. This mechanism serves as a counterpart to centrality reward, helping to balance recommendation stability and exploratory behavior. The reward is defined as

R_{n} = \sum_{e_{i} \in P, r_{i} \neq r_{0}} min (0.5, \frac{1}{\sqrt{d (e_{i})}})

(12)

To avoid overemphasis on extremely rare nodes in path decisions, we cap the novelty reward per node at 0.5. This setup helps the model balance its focus on the core graph structure while enabling exploration of less-visited regions, thus improving recommendation diversity and knowledge discovery.

Additionally, to reduce ineffective paths, this paper introduces a self-loop penalty mechanism, which prevents the agent from remaining at the same node and making redundant visits, defined as follows:

{Penalty}_{loop} = 1 - λ \cdot N_{loop}, λ = 0.1

(13)

where

N_{loop}

represents the number of loops or self-repetitions in the path, and

λ = 0.1

is a weighting factor that adjusts the penalty magnitude. However, due to the environment-level and policy-level pruning strategies, redundant loops are already extremely rare in practice. Therefore, the self-loop penalty term mainly serves as a safeguard to handle degenerate cases rather than a frequently activated constraint. In this paper,

λ

is set to 0.1, which represents a conservative and empirically stable value that effectively discourages redundant self-loops while maintaining model stability and accuracy.

By integrating the above components, the final structure-aware reward function is defined as follows:

R = max (0, (w_{c} \cdot R_{c} + w_{r} \cdot R_{r} + w_{f} \cdot R_{f} + w_{n} \cdot R_{n}) \cdot {Penalty}_{loop})

(14)

where the weighting factors are

w_{c} = 0.2

,

w_{r} = 0.3

,

w_{f} = 0.4

,

w_{n} = 0.1

. This reward function integrates centrality, relevance, user preference, and novelty, while penalizing loops to encourage diverse and effective path exploration in the RL process.

4. Experiments

4.1. Datasets and Experimental Settings

To evaluate the efficacy of StAR in smart education scenarios, we conduct experiments on two real educational datasets: the open-source MoocCube [34] and the licensed COCO [35]. A unified data preprocessing pipeline is designed for both datasets, including entity and relation extraction, filtering samples with fewer than 10 interactions, splitting into training/validation/test sets, constructing course knowledge graphs, and embedding learning using TransE, along with concept extraction and alignment using the SkillNer 1.0.3 [36] toolkit, providing high-quality structured knowledge to support model training. The statistics of the datasets are shown in Table 2.

Experiments were conducted on Ubuntu 20.04.6 LTS, using an NVIDIA RTX 4090 GPU, Python 3.10, and CUDA 12.6, and the model was trained with PyTorch 2.9.0. We use an Actor-Critic framework with 4-head MHA to improve state representation. Key training hyperparameters are shown in Table 3. We adopt an interaction-wise, mutually exclusive split with a fixed random seed, in which each user-course interaction is assigned to exactly one subset (80% for training, 10% for validation, and 10% for testing). The RL agent is trained under an on-policy Actor-Critic (AC) framework, where the advantage is computed as

A_{t} = R_{t} - V (s_{t})

. The discount factor (

γ = 0.99

) and entropy coefficient (0.01) remain constant throughout training. Actions are sampled from a softmax (Boltzmann) policy during training and selected via deterministic top-k beam search during testing. Training is conducted for a fixed number of epochs with a linear learning-rate decay schedule and without early stopping, and all random seeds (default = 23) are fixed to ensure reproducibility. To enhance the stability of the embedding space, we apply L2 regularization to all entity and relation embeddings during TransE pretraining and adopt a smoothed negative-sampling distribution to reduce training variance. We further employ gradient clipping and patience-based early stopping to avoid unstable parameter updates. In the policy network, entropy regularization and Dropout are used to smooth policy learning and mitigate overfitting.

4.2. Performance Evaluation Metrics

To comprehensively and objectively evaluate the performance of the proposed StAR method, this study adopts four commonly used evaluation metrics in recommendation systems: Normalized Discounted Cumulative Gain (NDCG), Recall, Hit Rate (HR), and Precision, to evaluate recommendation accuracy and effectiveness from multiple perspectives. The definitions of each metric are as follows:

NDCG@K = \frac{DCG@K}{IDCG@K}

(15)

Recall@K = \frac{| R (u) \cap T (u) |}{| T (u) |}

(16)

Precision@K = \frac{| R (u) \cap T (u) |}{| R (u) |}

(17)

HR@K = \{\begin{matrix} 1, & if any relevant item is in the Top-K list \\ 0, & otherwise \end{matrix}

(18)

where

R (u)

denotes the set of recommended items for user u, and

T (u)

represents the set of ground truth items for user u. These metrics are used to evaluate the quality of the recommendations, with NDCG@K providing a normalized measure of ranking quality, Recall@K and Precision@K assessing the overlap with ground truth, and HR@K indicating the presence of relevant items in the Top-K list. Consistent with PGPR [10] and UPGPR [11], we evaluate all models at the standard cutoff K = 10, which is widely adopted in knowledge-graph-based recommendation tasks.

4.3. Results and Analysis

To verify the effectiveness of the proposed method, we compare StAR with five recommenders—Popularity-based Recommendation (Pop), Collaborative Filtering over Knowledge Graphs (CFKG), KGAT [24], PGPR [10], and UPGPR [11]—on the COCO and MoocCube datasets.

The experimental results are shown in Table 4, the best results on COCO and MoocCube are marked in bold. On both datasets, StAR outperforms all baseline models across all evaluation metrics, demonstrating superior recommendation performance and generalization capability in complex educational knowledge graph scenarios.

Compared with popularity- and CF-based methods (Pop, CFKG), StAR exploits graph structure and user preference more effectively, yielding more accurate and diverse recommendations. Relative to the GNN-based KGAT, StAR performs RL path reasoning and produces explicit recommendation paths, improving interpretability. Compared with RL-based PGPR and UPGPR, StAR enhances state representation, applies dual-layer action pruning, and adopts a structure-aware reward, together improving exploration efficiency and path coherence. Overall, the results corroborate StAR’s advantages in accuracy, interpretability, and reasoning efficiency.

Furthermore, to provide a more comprehensive evaluation of the statistical robustness and sensitivity of the proposed model, we conducted additional Top-K analyses on the MoocCube dataset. Figure 4 presents the performance curves of NDCG@K and Recall@K for

K \in {5, 10, 20, 50}

; the shaded regions denote the 95% confidence intervals (CIs) estimated through user-level bootstrap resampling, illustrating the model’s robustness. As shown in the figure, StAR demonstrates stable and consistent performance across different K values. As K increases, both NDCG@K and Recall@K rise steadily for all models, reflecting the expected improvement with broader recommendation lists. Notably, StAR consistently achieves the highest values across all K settings, maintaining clear performance margins and exhibiting strong robustness and generalization under varying cutoffs. The most significant improvements occur when K increases from 5 to 10, after which the gains gradually plateau for

K \geq 20

, suggesting that the model reaches a stable performance regime at deeper recommendation depths. Therefore, the main results are reported at

K = 10

, which provides a balanced and representative evaluation point consistent with prior studies.

To ensure readability and maintain a clear presentation, detailed statistical variations (e.g., standard deviations and confidence intervals) are reported only for the main comparison results in Table 4 and the Top-K robustness analysis in Figure 4, while other tables summarize user-level mean performance for conciseness and consistency with prior studies.

4.4. Ablation Experiment

To further validate the contribution of each module to overall performance, we conducted systematic ablation studies on the COCO and MoocCube datasets. Specifically, we removed the MHA mechanism (

- MHA

), the Policy-layer Action Pruning strategy (−P_A_P), and the Structure-aware Reward function (

- Reward

). Additionally, we designed an experiment approximating a “no-pruning” scenario (−All_P) to assess the impact on model performance. The results are presented in Table 5.

Effectiveness of MHA
To evaluate MHA’s effect on state modeling, we conduct an ablation ( $- MHA$ ) by replacing MHA with simple feature concatenation. As shown in Table 5, removing the MHA leads to noticeable declines in all evaluation metrics on both the COCO and MoocCube datasets, indicating that MHA plays a key role in capturing both local and global dependencies between path nodes and in emphasizing critical information. Compared with the concatenation approach, MHA effectively suppresses noise and enhances the precision of state representation, thereby improving the action discrimination capability of the policy network. In summary, MHA significantly strengthens the capacity for state representation and is one of the key modules that support the improvement of path reasoning quality and recommendation performance in this method.
We independently examine the effectiveness of the MHA module for state representation by replacing the original MLP encoder in the baseline PGPR framework with MHA, while keeping all other training configurations unchanged, in order to assess its standalone contribution to model performance. As shown in Table 6, the MHA-enhanced variant achieves consistent gains across all four metrics on both COCO and MoocCube. For example, NDCG improves from 9.13 to 9.74 on COCO and from 18.31 to 19.11 on MoocCube, accompanied by similar improvements in Recall, HR, and Precision. These results indicate that MHA captures longer-range semantic dependencies along multi-hop paths and yields more expressive state representations, thereby enabling more accurate policy decisions during graph reasoning.
Effectiveness of Dual-Layer Action Pruning
To systematically evaluate the contribution of the dual-layer action pruning strategy to model performance, we design a series of ablation experiments, where the policy-layer Top-K pruning is removed (−P_A_P), retaining only the environment-layer pruning, in order to isolate and assess the impact of the policy layer in enhancing action discrimination while keeping action space manageable, and to avoid the training instability caused by an excessively large action space.
As shown in Table 5, when the policy-layer pruning is removed, the model exhibits significant drops across all evaluation metrics on both the COCO and MoocCube datasets, and the training convergence also becomes noticeably slower. These results indicate that the policy-layer pruning enhances the discriminative capability of action selection after coarse filtering, effectively eliminating low-value actions, thereby improving path reasoning efficiency and recommendation accuracy.
To further verify the necessity of the environment-layer pruning, we conduct a near “no-pruning” experiment (−All_P), where both max_acts and K are set to 1000, to preserve as many candidate actions as possible. The experimental results on the MoocCube dataset are shown in Figure 5. The model’s training time increases significantly, convergence slows down, and recommendation performance further deteriorates, indicating that environment-layer pruning plays a vital role in early-stage action filtering. Taken together, these experiments demonstrate that the dual-layer pruning strategy, through its “coarse-to-fine” collaborative filtering mechanism, effectively balances the trade-off between action space reduction and recommendation quality, and serves as a key design component for ensuring model efficiency and performance.
We independently examine the effectiveness of the dual-layer action-pruning (DAP) module by reintroducing both pruning layers into the baseline PGPR framework, forming the complete DAP configuration to assess its standalone contribution to the overall model performance. As shown in Table 7, the enhanced variant (+StAR/DAP) consistently matches or outperforms the baseline PGPR on Top-10 NDCG, Recall, HR, and Precision across both COCO and MoocCube. By pruning actions at both the policy-logit level and the environment search stage, DAP suppresses low-value or semantically irrelevant transitions while preserving high-quality candidates. This structured reduction of the action space facilitates more focused exploration and leads to improved reasoning efficiency and recommendation accuracy.
Effectiveness of the Structure-Aware Reward
To validate the impact of the structure-aware knowledge-based reward function on recommendation performance, we design a corresponding ablation experiment ( $- Reward$ ). Specifically, the reward is computed solely based on the simple similarity between the final node and the target node, removing the joint consideration of factors such as node centrality, path relevance, final-node similarity, and path novelty.
As shown in Table 5, removing the structure-aware reward function results in a decline in all evaluation metrics on both the COCO and MoocCube datasets, indicating that the reward function plays a critical role in guiding the agent to explore effective paths. Compared to traditional endpoint similarity-based rewards, the proposed structure-aware mechanism provides finer-grained and semantically rich feedback throughout the path generation process, effectively enhancing the logical coherence and personalization of the paths. The structure-aware reward function not only mitigates the sparse reward problem in reinforcement learning-based path reasoning but also provides essential support for enhancing both the interpretability and performance of the recommendations, making it a vital component contributing to the superior performance of the proposed method.
We further isolate the contribution of the structure-aware reward (SR) by replacing the baseline cosine reward in the baseline PGPR with SR, while keeping all other training and evaluation settings identical. As shown in Table 8, the enhanced variant (+StAR/SR) consistently matches or outperforms the baseline PGPR across Top-10 NDCG, Recall, HR, and Precision on both COCO and MoocCube datasets. Compared with the sparse terminal rewards in PGPR, SR introduces denser and more informative feedback by integrating path relevance, node centrality, final-node similarity, and path novelty. This design enables the agent to pursue structurally consistent reasoning paths and facilitates more stable and effective policy learning.
To quantitatively verify that the proposed structure-aware reward effectively alleviates the sparse-reward issue, we define the non-zero reward ratio per episode ( $n z r_{e p}$ ) as follows:

$n z r_{e p} = \frac{1}{N} \sum_{i = 1}^{N} I (\sum_{t = 1}^{T_{i}} I (r_{i, t} > 0) > 0),$

(19)

where N denotes the total number of episodes, $T_{i}$ is the length of the i-th episode, $r_{i, t}$ is the reward at step t, and $I (\cdot)$ is the indicator function. This metric measures the proportion of reasoning trajectories that receive at least one non-zero reward signal, directly reflecting the density of the reward feedback.
As shown in Table 9, on the MoocCube dataset, the proposed StAR model exhibits noticeably denser and smoother reward feedback than PGPR and UPGPR. This finding suggests that the structure-aware reward effectively transforms the sparse terminal reward into continuous stepwise feedback, facilitating smoother credit propagation and more stable policy optimization.

Ablation results show that MHA, dual-layer action pruning, and structure-aware rewards each play a vital role in improving performance. Working together, they enable StAR to achieve outstanding performance in recommendation accuracy, path reasoning coherence, and interpretability.

4.5. Parameter Analysis

4.5.1. Impact of the Number of Attention Heads

To further investigate the role of MHA in optimizing state representation, we conduct experiments on the COCO and MoocCube datasets with different numbers of attention heads (

head \in {1, 2, 4, 6, 8}

), using

head = 1

(i.e., traditional single-head attention) as the performance baseline. The results are shown in Figure 6.

Results show that on COCO, NDCG and Recall peak at head = 4, with noticeable gains over single-head attention (head = 1). When the number of heads is 6 or more, performance begins to decline, suggesting that too many heads may introduce redundancy. On the MoocCube dataset, head = 4 also yields the best performance, with NDCG reaching 19.76% and Recall reaching 27.02%, significantly outperforming the single-head mechanism. This indicates that a proper number of heads captures rich semantics and improves modeling, while too many heads may hinder learning. Head = 4 achieves the best results on both datasets, making it the optimal configuration for the StAR.

4.5.2. Impact of Top-K Pruning in Policy Layer

In policy-layer action pruning, the parameter K is used to control the number of retained actions, that is, from all candidate actions under the current state, the Top-K actions with the highest scores from the policy network are selected as the executable action set. This mechanism effectively reduces the interference of low-quality actions, thereby improving the precision and efficiency of path selection. To analyze the impact of the Top-K pruning parameter K on model performance, we test varying K values on COCO and MoocCube. Specifically,

K \in {10, 25, 50, 75, 100, 125}

, K = 10 and 25 represent strong pruning, where only the most relevant actions are kept, improving selection precision at the cost of diversity. K = 50 and 75 are moderate pruning levels, offering a balance between space and diversity. K = 100 and 125 indicate weak pruning, retaining more candidates to encourage diverse path exploration. The experimental results are shown in Figure 7.

Experimental results show that the optimal Top-K value varies across different datasets: On COCO, K = 50 yields the best NDCG and Recall, suggesting moderate pruning balances precision and action space. On MoocCube, the best result occurs at K = 100, indicating that weak pruning retains diverse, valuable paths. This reflects that action pruning effectiveness depends on the dataset structure and user interaction patterns. COCO’s wide course variety and user base favor moderate pruning to narrow the search scope. MoocCube’s narrower course coverage benefits from less aggressive pruning. Hence, the Top-K value should be adapted to dataset properties, to balance exploration and discrimination, improving overall recommendation quality and efficiency.

4.5.3. Impact of Reward Function Weight

To pursue an educational goal that prioritizes accuracy while maintaining path coherence and interpretability, we adopt the prior preference

R_{f} > R_{r} > R_{c} > R_{n}

, where final-item similarity (

R_{f}

) dominates, followed by path relevance (

R_{r}

), structural centrality (

R_{c}

), and novelty (

R_{n}

). This principle reflects the idea of making the content accurate first, ensuring logical reasoning paths, and then allowing moderate exploration.

To examine the rationality of this ordering under the goal of “accuracy first, interpretability and path coherence second, and moderate novelty last, ” six control configurations (A0–A5, shown in Table 10) were designed. Each group adjusts only the reward weights

(w_{c}, w_{r}, w_{f}, w_{n})

, while all other training and evaluation settings remain identical to the main experiment.

As shown in Table 10, the StAR configuration A1 (

w_{c} = 0.2, w_{r} = 0.3, w_{f} = 0.4, w_{n} = 0.1

) achieves the best overall results, while the uniform-weight baseline A0 performs the worst, indicating that equal weighting fails to capture the hierarchical importance of reward components. The terminal-only configuration A2 attains slightly higher precision but still lags behind A1 overall, implying that relying solely on final-item similarity is not optimal for balanced recommendation. Removing the novelty term (A3) causes a minor decline, whereas excessive novelty (A5) leads to further degradation, showing that moderate novelty fosters effective exploration but excessive novelty undermines stability. The comparison between A1 and A4 also confirms that assigning greater weight to path relevance (

w_{r} > w_{c}

) improves reasoning coherence and interpretability. In summary, the weighting configuration

(0.2, 0.3, 0.4, 0.1)

proves to be the most appropriate for the StAR model, as it effectively satisfies the educational objective of achieving accurate, interpretable, and pedagogically coherent recommendations.

4.6. Interpretability Evaluation

Following prior studies [37,38], we conducted an Interpretability@10 experiment on the MoocCube dataset to evaluate the interpretability of the proposed method. For each user, we selected the Top-10 recommended courses. For each course, only the reasoning path with the highest probability was retained. From this path, we extracted the set of attribute entities

P_{u}

, while the ground-truth attribute set of the same course in the knowledge graph was denoted as

A_{u}

. The interpretability metrics were defined as

Precision = \frac{| P_{u} \cap A_{u} |}{| P_{u} |}, Recall = \frac{| P_{u} \cap A_{u} |}{| A_{u} |}, F 1 = \frac{2 P R}{P + R} .

(20)

As illustrated in Figure 8, the proposed StAR (Ours) consistently outperforms the baseline PGPR in all three metrics, achieving higher values of Precision, Recall, and F1, which indicates that the StAR method (Ours) has stronger interpretability.

4.7. Case Analysis

To intuitively demonstrate the interpretability and practical workflow of the proposed StAR model, this section selects a learner (Student A) from the MoocCube dataset as a case example. The goal is to recommend the next suitable course for this learner based on their historical learning profile.

As illustrated in Figure 9, this case reflects how a learner interacts with the model in a real educational platform. After logging into the system, the platform locates the corresponding student node in the educational knowledge graph and retrieves the linked entities representing the learner’s enrolled or interested courses. The StAR agent then performs multi-hop reasoning under the trained policy, constrained by the dual-layer action pruning mechanism, to explore candidate reasoning paths and aggregate them into personalized course recommendations. The results are visualized as interpretable paths—e.g., “student → enrolled course → concept → target course”—which intuitively reveal to learners the reasoning behind each recommendation.

In the given path, student user A previously enrolled in Python Programming Basic, which contains the concept of “Lambda Expressions.” During path generation, the StAR integrates four reward signals: path relevance, node centrality, final-node similarity, and path novelty, guiding the model to focus on relevant and representative course nodes, ensuring strong alignment between recommendation and learner state, while maintaining exploration and avoiding redundant recommendations. Ultimately, the model recommends an advanced course, Advanced Python Features, which also contains “Lambda Expressions.” The resulting recommendation path is structurally clear and semantically coherent, not only revealing the model’s reasoning process but also enhancing the interpretability of the result and the learner’s trust.

5. Conclusions

This paper proposes StAR, an explainable learning resource recommendation method that integrates structure-aware knowledge perception with MHA, aiming to address key challenges such as insufficient state representation, large action space complexity, and sparse reward signals. Specifically, StAR enhances state representations through MHA, employs a dual-layer action pruning strategy to reduce the action space, and introduces a novel structure-aware reward function to guide efficient and interpretable path reasoning. Experimental results demonstrate that StAR consistently outperforms existing baseline methods across multiple evaluation metrics, validating its effectiveness and superiority. Future work will explore the incorporation of dynamic user interaction data, automated hyperparameter optimization strategies, and the potential for cross-domain applications to further advance the development of AI-augmented learning in smart cities.

Author Contributions

Conceptualization, T.B. and H.Z.; Methodology, T.B. and F.Z.; Visualization, T.B.; Writing—original draft, T.B.; Data curation, T.B.; Validation, T.B. and F.Z.; Project administration, H.Z.; Supervision, H.Z. and F.Z.; Writing—review & editing, H.Z. and F.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the National Natural Science Foundation of China under grant number 61976118 and the Postgraduate Research & Practice Innovation Program of Jiangsu Province under project number KYCX25_2276.

Data Availability Statement

The MOOCCube dataset is available at http://moocdata.cn/data/MOOCCube (accessed on 10 October 2024); The COCO dataset can be obtained by contacting the authors of COCO: Semantic-Enriched Collection of Online Courses at Scale with Experimental Use Cases via email at mirko.marras@unica.it.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Kruhlov, V.; Dvorak, J. Social Inclusivity in the Smart City Governance: Overcoming the Digital Divide. Sustainability 2025, 17, 5735. [Google Scholar] [CrossRef]
Jiang, S.; Song, H.; Lu, Y.; Zhang, Z. News Recommendation Method Based on Candidate-Aware Long-and Short-Term Preference Modeling. Appl. Sci. 2024, 15, 300. [Google Scholar] [CrossRef]
Raj, N.S.; Renumol, V. An improved adaptive learning path recommendation model driven by real-time learning analytics. J. Comput. Educ. 2024, 11, 121–148. [Google Scholar] [CrossRef]
Chen, G.; Chen, P.; Wang, Y.; Zhu, N. Research on the development of an effective mechanism of using public online education resource platform: TOE model combined with FS-QCA. Interact. Learn. Environ. 2024, 32, 6096–6120. [Google Scholar] [CrossRef]
Maimaitijiang, E.; Aihaiti, M.; Mamatjan, Y. An Explainable AI Framework for Online Diabetes Risk Prediction with a Personalized Chatbot Assistant. Electronics 2025, 14, 3738. [Google Scholar] [CrossRef]
Koren, Y.; Rendle, S.; Bell, R. Advances in collaborative filtering. In Recommender Systems Handbook; Springer: New York, NY, USA, 2021; pp. 91–142. [Google Scholar]
Son, J.; Kim, S.B. Content-based filtering for recommendation systems using multiattribute networks. Expert Syst. Appl. 2017, 89, 404–412. [Google Scholar] [CrossRef]
Quadrana, M.; Cremonesi, P.; Jannach, D. Sequence-aware recommender systems. ACM Comput. Surv. 2018, 51, 1–36. [Google Scholar] [CrossRef]
Peng, C.; Xia, F.; Naseriparsa, M.; Osborne, F. Knowledge graphs: Opportunities and challenges. Artif. Intell. Rev. 2023, 56, 13071–13102. [Google Scholar] [CrossRef]
Xian, Y.; Fu, Z.; Muthukrishnan, S.; De Melo, G.; Zhang, Y. Reinforcement knowledge graph reasoning for explainable recommendation. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval, Paris, France, 21–25 July 2019; pp. 285–294. [Google Scholar]
Frej, J.; Shah, N.; Knezevic, M.; Nazaretsky, T.; Käser, T. Finding paths for explainable MOOC recommendation: A learner perspective. In Proceedings of the 14th Learning Analytics and Knowledge Conference, Kyoto, Japan, 18–22 March 2024; pp. 426–437. [Google Scholar]
Afsar, M.M.; Crump, T.; Far, B. Reinforcement learning based recommender systems: A survey. ACM Comput. Surv. 2022, 55, 1–38. [Google Scholar] [CrossRef]
Liu, H.; Cai, K.; Li, P.; Qian, C.; Zhao, P.; Wu, X. REDRL: A review-enhanced Deep Reinforcement Learning model for interactive recommendation. Expert Syst. Appl. 2023, 213, 118926. [Google Scholar] [CrossRef]
Riedmiller, M.; Hafner, R.; Lampe, T.; Neunert, M.; Degrave, J.; Wiele, T.; Mnih, V.; Heess, N.; Springenberg, J.T. Learning by playing solving sparse reward tasks from scratch. In Proceedings of the 35th International Conference on Machine Learning, Stockholm, Sweden, 10–15 July 2018; pp. 4344–4353. [Google Scholar]
Ma, A.; Yu, Y.; Shi, C.; Zhen, S.; Pang, L.; Chua, T.-S. PMHR: Path-Based Multi-Hop Reasoning Incorporating Rule-Enhanced Reinforcement Learning and KG Embeddings. Electronics 2024, 13, 4847. [Google Scholar] [CrossRef]
Li, Y.; Lu, L.; Li, X. A hybrid collaborative filtering method for multiple-interests and multiple-content recommendation in E-Commerce. Expert Syst. Appl. 2005, 28, 67–77. [Google Scholar] [CrossRef]
Deldjoo, Y.; Schedl, M.; Cremonesi, P.; Pasi, G. Recommender systems leveraging multimedia content. ACM Comput. Surv. 2020, 53, 1–38. [Google Scholar] [CrossRef]
Sarwar, B.; Karypis, G.; Konstan, J.; Riedl, J. Item-based collaborative filtering recommendation algorithms. In Proceedings of the 10th International Conference on World Wide Web, Hong Kong, China, 1–5 May 2001; pp. 285–295. [Google Scholar]
Pazzani, M.J.; Billsus, D. Content-based recommendation systems. In The Adaptive Web: Methods and Strategies of Web Personalization; Springer: Berlin/Heidelberg, Germany, 2007; pp. 325–341. [Google Scholar]
Sinha, R.; Swearingen, K. The role of transparency in recommender systems. In Proceedings of the CHI’02 Extended Abstracts on Human Factors in Computing Systems, Minneapolis, MN, USA, 20–25 April 2002; pp. 830–831. [Google Scholar]
Rendle, S.; Freudenthaler, C.; Schmidt-Thieme, L. Factorizing personalized markov chains for next-basket recommendation. In Proceedings of the 19th International Conference on World Wide Web, Raleigh, NC, USA, 26–30 April 2010; pp. 811–820. [Google Scholar]
Bordes, A.; Usunier, N.; Garcia-Durán, A.; Weston, J.; Yakhnenko, O. Translating embeddings for modeling multi-relational data. In Proceedings of the 27th International Conference on Neural Information Processing Systems (NIPS’13), Lake Tahoe, NV, USA, 5–10 December 2013; Curran Associates Inc.: Red Hook, NY, USA, 2013; pp. 2787–2795. [Google Scholar]
Zhang, F.; Yuan, N.J.; Lian, D.; Xie, X.; Ma, W.-Y. Collaborative knowledge base embedding for recommender systems. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; pp. 353–362. [Google Scholar]
Wang, X.; He, X.; Cao, Y.; Liu, M.; Chua, T.-S. Kgat: Knowledge graph attention network for recommendation. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, Anchorage, AK, USA, 4–8 August 2019; pp. 950–958. [Google Scholar]
Wang, X.; Wang, D.; Xu, C.; He, X.; Cao, Y.; Chua, T.-S. Explainable reasoning over knowledge graphs for recommendation. Proc. AAAI Conf. Artif. Intell. 2019, 33, 5329–5336. [Google Scholar] [CrossRef]
Garcia, F.; Rachelson, E. Markov decision processes. In Markov Decision Processes in Artificial Intelligence; John Wiley & Sons, Inc.: Hoboken, NJ, USA, 2013; pp. 1–38. [Google Scholar]
Gao, H.; Ma, B. Online Learning Strategy Induction through Partially Observable Markov Decision Process-Based Cognitive Experience Model. Electronics 2024, 13, 3858. [Google Scholar] [CrossRef]
Zhu, A.; Ouyang, D.; Liang, S.; Shao, J. Step by step: A hierarchical framework for multi-hop knowledge graph reasoning with reinforcement learning. Knowl.-Based Syst. 2022, 248, 108843. [Google Scholar] [CrossRef]
Zheng, S.; Chen, W.; Wang, W.; Zhao, P.; Yin, H.; Zhao, L. Multi-hop knowledge graph reasoning in few-shot scenarios. IEEE Trans. Knowl. Data Eng. 2023, 36, 1713–1727. [Google Scholar] [CrossRef]
Tao, S.; Qiu, R.; Ping, Y.; Ma, H. Multi-modal knowledge-aware reinforcement learning network for explainable recommendation. Knowl.-Based Syst. 2021, 227, 107217. [Google Scholar] [CrossRef]
Su, J.; Huang, J.; Adams, S.; Chang, Q.; Beling, P.A. Deep multi-agent reinforcement learning for multi-level preventive maintenance in manufacturing systems. Expert Syst. Appl. 2022, 192, 116323. [Google Scholar] [CrossRef]
Lin, Y.; Liu, Y.; Lin, F.; Zou, L.; Wu, P.; Zeng, W.; Chen, H.; Miao, C. A survey on reinforcement learning for recommender systems. IEEE Trans. Neural Netw. Learn. Syst. 2023, 35, 13164–13184. [Google Scholar] [CrossRef] [PubMed]
Mu, R.; Marcolino, L.S.; Zhang, Y.; Zhang, T.; Huang, X.; Ruan, W. Reward certification for policy smoothed reinforcement learning. Proc. AAAI Conf. Artif. Intell. 2024, 38, 21429–21437. [Google Scholar] [CrossRef]
Yu, J.; Luo, G.; Xiao, T.; Zhong, Q.; Wang, Y.; Feng, W.; Luo, J.; Wang, C.; Hou, L.; Li, J. MOOCCube: A large-scale data repository for NLP applications in MOOCs. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; pp. 3135–3142. [Google Scholar]
Dessì, D.; Fenu, G.; Marras, M.; Reforgiato Recupero, D. Coco: Semantic-enriched collection of online courses at scale with experimental use cases. In Trends and Advances in Information Systems and Technologies, Proceedings of the World Conference on Information Systems and Technologies, Naples, Italy, 27–29 March 2018; Springer: Cham, Switzerland, 2018; pp. 1386–1396. [Google Scholar]
Fareri, S.; Melluso, N.; Chiarello, F.; Fantoni, G. SkillNER: Mining and mapping soft skills from any text. Expert Syst. Appl. 2021, 184, 115544. [Google Scholar] [CrossRef]
Tan, J.; Xu, S.; Ge, Y.; Li, Y.; Chen, X.; Zhang, Y. Counterfactual explainable recommendation. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management, Online, 1–5 November 2021; pp. 1784–1793. [Google Scholar]
Chen, C.; Zhang, M.; Liu, Y.; Ma, S. Neural attentional rating regression with review-level explanations. In Proceedings of the 2018 World Wide Web Conference, Lyon, France, 23–27 April 2018; pp. 1583–1592. [Google Scholar]

Figure 1. Explainable Learning Resource Recommendation Pipeline.

Figure 2. State representation optimization via MHA. The original state

S_{t}

is composed of user, current node, and historical node embeddings, and is enhanced into

S_{t}^{'}

through scaled dot-product MHA.

Figure 2. State representation optimization via MHA. The original state

S_{t}

is composed of user, current node, and historical node embeddings, and is enhanced into

S_{t}^{'}

through scaled dot-product MHA.

Figure 3. Composition of the structure-aware reward function. The reward is formed by aggregating four components—node centrality, path relevance, final node similarity, and path novelty—along with a self-loop penalty to suppress redundant transitions.

Figure 4. Performancecurves of NDCG@K and Recall@K with 95% confidence intervals estimated by user-level bootstrap on MoocCube, where the red pentagram markers indicate the results at

K = 10

.

Figure 4. Performancecurves of NDCG@K and Recall@K with 95% confidence intervals estimated by user-level bootstrap on MoocCube, where the red pentagram markers indicate the results at

K = 10

.

Figure 5. Training reward under different action pruning strategies on MoocCube.

Figure 6. Impact of the number of attention heads on model performance. (a,b) show the NDCG and Recall results on the COCO dataset, while (c,d) present the corresponding results on the MoocCube dataset.

Figure 7. Impact of the Top-K pruning parameter in the policy layer. Subfigures (a,b) illustrate the performance of the model on the COCO and MoocCube datasets, respectively, under varying Top-K values in the policy-level pruning strategy. Both NDCG and Recall are reported.

Figure 8. Interpretability comparison between StAR (Ours) and PGPR on the MoocCube dataset.

Figure 9. Example of an interpretable course recommendation generated by StAR on the MoocCube dataset.

Table 1. Relation types and their semantic descriptions in the knowledge graph.

Relation Type	Semantic Description
Enrolled	The user is enrolled in a course (User-Course)
Teach	A teacher teaches a course (Teacher-Course)
Contain	A course contains certain concepts (Concept-Course)
Provide	A school provides a course (School-Course)
Belong_to	A course belongs to a category (Category-Course)

Table 2. Statistics of Educational Datasets: COCO and MoocCube.

Metric	COCO	MoocCube
Users	17,457	6507
Courses	20,926	687
Interactions	279,792	97,223
Avg. Interactions	16.1	14.9

Table 3. Training hyperparameters used in experiments.

Hyperparameters	Value
Optimizer	Adam
Learning Rate	0.001
Entropy Weight	0.01
Gamma	0.99
Batch Size	32
Epoch	50
Hidden	[512, 256]
Max Actions	250

Table 4. Performance comparison of different models on COCO and MoocCube datasets (mean ± standard deviation). The results are reported in percentage (%) and calculated based on the Top-10 predictions in the test set. The best results are highlighted in bold.

Dataset	Model	NDCG/%	Recall/%	HR/%	Precision/%
COCO	Pop	$3.57 \pm 0.0$	$7.36 \pm 0.0$	$12.37 \pm 0.0$	$1.33 \pm 0.0$
	CFKG	$8.31 \pm 0.0$	$13.66 \pm 0.1$	$13.68 \pm 0.2$	$2.21 \pm 0.1$
	KGAT	$8.95 \pm 0.2$	$13.21 \pm 0.4$	$23.02 \pm 0.7$	$2.52 \pm 0.2$
	PGPR	$9.13 \pm 0.2$	$14.12 \pm 0.2$	$24.11 \pm 0.4$	$2.64 \pm 0.0$
	UPGPR	$9.56 \pm 0.1$	$14.79 \pm 0.1$	$24.97 \pm 0.2$	$2.77 \pm 0.1$
	StAR	$10.92 \pm 0.2$	$16.66 \pm 0.2$	$27.60 \pm 0.3$	$3.07 \pm 0.0$
MoocCube	Pop	$7.53 \pm 0.0$	$12.51 \pm 0.0$	$21.99 \pm 0.1$	$2.34 \pm 0.0$
	CFKG	$10.56 \pm 0.4$	$20.89 \pm 0.9$	$34.61 \pm 1.2$	$3.83 \pm 0.2$
	KGAT	$12.84 \pm 0.6$	$25.49 \pm 1.2$	$41.26 \pm 1.4$	$4.72 \pm 0.3$
	PGPR	$18.31 \pm 0.1$	$24.17 \pm 0.1$	$40.08 \pm 0.3$	$4.70 \pm 0.1$
	UPGPR	$18.77 \pm 0.1$	$25.92 \pm 0.2$	$42.81 \pm 0.3$	$4.99 \pm 0.0$
	StAR	$19.74 \pm 0.3$	$27.00 \pm 0.4$	$44.15 \pm 0.6$	$5.19 \pm 0.1$

Table 5. Ablation study results on COCO and MoocCube datasets. The results are reported in percentage (%) and calculated based on the Top-10 predictions in the test set. The full StAR model is highlighted in bold.

Model	COCO				MoocCube
Model	NDCG	Recall	HR	Precision	NDCG	Recall	HR	Precision
StAR	10.92	16.66	27.60	3.07	19.74	27.00	44.15	5.19
$- MHA$	9.83	14.57	24.88	2.74	19.08	25.79	42.65	4.93
−P_A_P	9.87	14.77	25.03	2.76	19.16	26.02	43.35	5.03
−All_P	9.77	14.60	24.87	2.72	18.87	25.41	42.40	4.89
$- Reward$	10.16	15.31	25.83	2.87	19.49	26.55	43.89	5.16

Table 6. Performance of the baseline PGPR vs. PGPR with the MHA state encoder (+StAR/MHA) on COCO and MoocCube. The results are reported in percentage (%) and calculated based on the Top-10 predictions in the test set.

Model	COCO				MoocCube
Model	NDCG	Recall	HR	Precision	NDCG	Recall	HR	Precision
PGPR	9.13	14.12	24.11	2.64	18.31	24.17	40.08	4.70
+StAR/MHA	9.74	14.56	24.84	2.71	19.11	25.87	42.71	5.03

Table 7. Performance of the baseline PGPR vs. PGPR with the dual-layer action-pruning module (+StAR/DAP) on COCO and MoocCube. The results are reported in percentage (%) and calculated based on the Top-10 predictions in the test set.

Model	COCO				MoocCube
Model	NDCG	Recall	HR	Precision	NDCG	Recall	HR	Precision
PGPR	9.13	14.12	24.11	2.64	18.31	24.17	40.08	4.70
+StAR/DAP	9.33	13.90	23.86	2.63	18.88	25.60	42.20	4.98

Table 8. Performance of the baseline PGPR vs. PGPR with the structure-aware reward (+StAR/SR) on COCO and MoocCube. The results are reported in percentage (%) and calculated based on the Top-10 predictions in the test set.

Model	COCO				MoocCube
Model	NDCG	Recall	HR	Precision	NDCG	Recall	HR	Precision
PGPR	9.13	14.12	24.11	2.64	18.31	24.17	40.08	4.70
+StAR/SR	10.32	15.85	26.54	2.96	19.51	26.51	43.88	5.19

Table 9. Comparison of reward sparsity among PGPR, UPGPR, and StAR.

Model	Reward Type	${nzr}_{ep}$	Interpretation
UPGPR	Binary (terminal only)	0.225	Highly sparse
PGPR	Cosine (terminal continuous)	0.462	Moderately sparse
StAR (ours)	Structure-aware (stepwise dense)	0.998	Substantially dense

Table 10. Comparison of different reward weight configurations (A0–A5) on the MoocCube dataset. The results are reported in percentage (%) and calculated based on the Top-10 predictions in the test set.

Group	NDCG	Recall	HR	Precision	$(w_{c}, w_{r}, w_{f}, w_{n})$	Description
A0	18.78	25.56	42.46	4.97	$(0.25, 0.25, 0.25, 0.25)$	Uniform weights
A1	19.74	27.03	44.15	5.19	$(0.20, 0.30, 0.40, 0.10)$	StAR configuration
A2	19.37	26.50	43.72	5.07	$(0.00, 0.00, 1.00, 0.00)$	Final-similarity only
A3	19.38	26.28	43.11	5.12	$(0.25, 0.35, 0.40, 0.00)$	No novelty
A4	19.58	26.68	43.91	5.12	$(0.30, 0.20, 0.40, 0.10)$	Swap $w_{r} / w_{c}$
A5	19.11	25.87	42.35	4.98	$(0.15, 0.25, 0.40, 0.20)$	High novelty

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Bu, T.; Zheng, H.; Zhao, F. A Structure-Aware and Attention-Enhanced Explainable Learning Resource Recommendation Approach for Smart Education Within Smart Cities. Electronics 2025, 14, 4561. https://doi.org/10.3390/electronics14234561

AMA Style

Bu T, Zheng H, Zhao F. A Structure-Aware and Attention-Enhanced Explainable Learning Resource Recommendation Approach for Smart Education Within Smart Cities. Electronics. 2025; 14(23):4561. https://doi.org/10.3390/electronics14234561

Chicago/Turabian Style

Bu, Tianxue, Hao Zheng, and Fen Zhao. 2025. "A Structure-Aware and Attention-Enhanced Explainable Learning Resource Recommendation Approach for Smart Education Within Smart Cities" Electronics 14, no. 23: 4561. https://doi.org/10.3390/electronics14234561

APA Style

Bu, T., Zheng, H., & Zhao, F. (2025). A Structure-Aware and Attention-Enhanced Explainable Learning Resource Recommendation Approach for Smart Education Within Smart Cities. Electronics, 14(23), 4561. https://doi.org/10.3390/electronics14234561

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Structure-Aware and Attention-Enhanced Explainable Learning Resource Recommendation Approach for Smart Education Within Smart Cities

Abstract

1. Introduction

2. Related Work

3. Methodology

3.1. Task Definition

3.2. State Representation Optimization Based on Multi-Head Attention

3.3. Dual-Layer Action Pruning Strategy

3.4. Design of Structure-Aware Reward Function

4. Experiments

4.1. Datasets and Experimental Settings

4.2. Performance Evaluation Metrics

4.3. Results and Analysis

4.4. Ablation Experiment

4.5. Parameter Analysis

4.5.1. Impact of the Number of Attention Heads

4.5.2. Impact of Top-K Pruning in Policy Layer

4.5.3. Impact of Reward Function Weight

4.6. Interpretability Evaluation

4.7. Case Analysis

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI