A Dueling DQN-Based Hyper-Heuristic Framework for Learning Path Optimization

Zhang, Yong-Wei; Zhu, Ming-Yang; Xia, Wen-Kai; Zhang, Xin-Yang; Liu, Jin-Di

doi:10.3390/bdcc10050153

Open AccessArticle

A Dueling DQN-Based Hyper-Heuristic Framework for Learning Path Optimization

by

Yong-Wei Zhang

,

Ming-Yang Zhu

^*

,

Wen-Kai Xia

,

Xin-Yang Zhang

and

Jin-Di Liu

School of Automation, Jiangsu University of Science and Technology, Zhenjiang 212100, China

^*

Author to whom correspondence should be addressed.

Big Data Cogn. Comput. 2026, 10(5), 153; https://doi.org/10.3390/bdcc10050153

Submission received: 22 March 2026 / Revised: 28 April 2026 / Accepted: 7 May 2026 / Published: 13 May 2026

(This article belongs to the Section Data Mining and Machine Learning)

Download

Browse Figures

Versions Notes

Abstract

Learning path optimization is crucial in intelligent educational systems, with the core challenge of efficient multi-objective sequential decision-making under complex prerequisite constraints. To address the poor generalization of existing methods relying on fixed operator scheduling or handcrafted heuristics, this paper proposes a hyper-heuristic framework based on Dueling Deep Q-Network (Dueling DQN-HH), formulating operator selection as a sequential decision-making process for dynamic adaptive scheduling of low-level operators. The framework adopts priority-based encoding to unify learning path representation (decoupling the hyper-heuristic layer from the problem domain) and designs a composite reward mechanism integrating reward shaping, exploration incentives, and computational cost awareness to balance solution quality and efficiency. Additionally, it employs a dueling network architecture with prioritized experience replay to enhance policy learning stability. Experimental results show the proposed method outperforms representative baseline algorithms in solution quality, convergence stability, and computational efficiency. The framework demonstrates superior performance across multiple objectives, particularly in minimizing the total learning time (F_time), as validated on two heterogeneous datasets: MOOCCube (Computer Science) and PsyDataset (Psychology). Further ablation studies and operator evolution analyses verify its adaptive scheduling capability under different objectives and knowledge graph structures, demonstrating strong objective independence and cross-dataset generalization.

Keywords:

learning path optimization; Dueling Deep Q-Network; hyper-heuristic; composite reward mechanism; cross-dataset generalization

1. Introduction

With the rapid growth in massive open online courses, the “information overload” problem [1] from heterogeneous educational resources has become a major obstacle to online learning efficiency [2]. To maximize efficiency under limited cognitive resources, personalized learning path optimization (LPO) has emerged: it is a complex sequential decision problem defined on a weighted directed acyclic graph (DAG) [3], and must satisfy strict topological precedence constraints and learners’ memory behaviors (integrating the Ebbinghaus forgetting mechanism [4] and cognitive load theory [5]). This renders LPO a strongly constrained NP-hard combinatorial optimization problem [6].

To solve LPO, the academic community has shifted to metaheuristic algorithms (e.g., genetic algorithms [7,8], ant colony optimization [9]) instead of computationally expensive exact solutions. However, per the “No Free Lunch” theorem [10], single low-level heuristic operators struggle with complex search spaces, leading to challenges in constraint handling (where random operators produce invalid solutions, wasting resources on corrections) and exploration–exploitation imbalance (causing premature convergence to local optima [6]) for DAG-based LPO. To address these issues, deep reinforcement learning (DRL)-based hyper-heuristic frameworks have become a research hotspot [11,12]. Beyond the educational domain, learning-driven hyper-heuristics have demonstrated success in various combinatorial optimization challenges, such as medical crew scheduling involving tri-level decision constraints [13] and industrial scheduling in distributed heterogeneous hybrid flow shops [14]. The core of these successes lies in the shift from static operator execution to adaptive search control. This trend is increasingly evident in diverse engineering optimizations where computational efficiency is paramount. For instance, in UAV path scheduling and trajectory optimization, meta-heuristic variants are specifically designed to balance search exploration with operational power consumption [15]. Similarly, in geotechnical engineering, integrating machine learning with meta-heuristic optimizations enables high-precision predictions while minimizing redundant parameter evaluations [16]. These cross-disciplinary advancements highlight that incorporating cost-awareness into strategy scheduling is essential for navigating high-dimensional search landscapes. By prioritizing operators with higher “benefit–cost ratios,” such frameworks can significantly mitigate the efficiency bottlenecks inherent in constrained optimization. Despite these broader successes, their application to LPO still faces hurdles: sparse rewards make operator contributions indistinct in lengthy sequences, causing feedback delays that hinder standard DQN from capturing operator effectiveness differences and lead to value overestimation and suboptimal strategies [17,18]; while poor generalization arises from hard-coded masking strategies lacking universal encoding, limiting generalization across knowledge graphs [6]. LPO’s inherent cross-disciplinary diversity and structural complexity make it an ideal testbed for evaluating the adaptability of hyper-heuristic frameworks. Thus, this paper uses LPO as a test case to explore hyper-heuristic solutions.

However, our quantitative analysis reveals a significant “efficiency wall” in mainstream non-adaptive algorithms. For a knowledge graph with 287 points, even the powerful Genetic Algorithm-based Hyper-heuristic (GA-HH) exhibits an optimization efficiency (Eff.) that is 20.4% lower than our proposed Dueling DQN-HH. As shown in our ablation study, the lack of adaptive guidance leads to a 2.2% degradation in final solution quality for standard DQN compared to the full Dueling architecture, and traditional methods typically stagnate after only 25% of the evaluation budget. To tackle these problems, this paper proposes a Dueling DQN-based hyperheuristic framework: it models operator selection as a long-term sparse-reward decision problem, decouples state values and operator advantages via the Dueling architecture to reduce value estimation instability, and designs a domain-isolated encoding–decoding mechanism to separate constraint processing from strategy learning for stable learning in unconstrained spaces. Additionally, a cost-aware reward mechanism and diversified operator library support adaptive decision-making. We evaluate the framework on LPO in terms of learning stability, search efficiency, and generalization ability. The main innovations are as follows:

We propose a cost-aware Dueling DQN-based hyperheuristic framework, which decouples state value from operator advantage for stable scheduling under sparse rewards.
We propose a domain isolation strategy for constrained combinatorial optimization, combining random key coding and greedy topology decoding to map constrained discrete problems to unconstrained continuous spaces, ensuring solution feasibility without expensive constraint processing or repair.
We design a cost-included nonlinear reward mechanism to guide agents in balancing lightweight/heavyweight operators via "benefit–cost ratio," optimizing computing resource allocation.
We construct a diversified optimization environment with a variety of heterogeneous target functions to verify the robustness and generalization ability of this hyperheuristic framework in complex search environments.

2. Problem Description and Modeling

2.1. Prerequisite Knowledge and Symbol Definitions

A knowledge graph is defined as a DAG:

G = (K, E)

(1)

where

K = {k_{1}, k_{2}, \dots, k_{n}}

denotes a set containing n knowledge points, and

E \subseteq K \times K

denotes the set of prerequisite dependencies. If

(k_{i}, k_{j}) \in E

, it indicates that

k_{i}

is a prerequisite knowledge point

k_{j}

and must be learned before

k_{j}

. The LPO problem defined on this DAG is NP-hard.

2.2. Mathematical Modeling

2.2.1. Decision Variable

Define the decision variable as a learning order of all knowledge points:

S = 〈 s_{1}, s_{2}, \dots, s_{n} 〉

(2)

where

s_{t} \in K

denotes the specific knowledge point learned at position (or time) t, n is the total number of knowledge points in the dataset, and S is a permutation of the set

K = {k_{1}, k_{2}, \dots, k_{n}}

, ensuring that each knowledge point is learned exactly once.

2.2.2. Constraints

To ensure the legitimacy of learning paths and the soundness of instructional logic, the following constraints must be satisfied:

Uniqueness constraint:
Each knowledge point must and can only be learned once, and only one knowledge point can be learned per time step.

$\sum_{t = 1}^{n} x_{i, t} = 1, \forall i \in {1, \dots, n}$

(3)

$\sum_{i = 1}^{n} x_{i, t} = 1, \forall t \in {1, \dots, n}$

(4)
Topological Constraint: For any dependency $(k_{i}, k_{j}) \in E$ , the learning time of prior knowledge point $k_{i}$ must strictly precede that of $k_{j}$ . Using decision variable $x_{i, t}$ , this constraint can be formalized as

$\sum_{t = 1}^{n} t \cdot x_{i, t} < \sum_{t = 1}^{n} t \cdot x_{j, t}, \forall (k_{i}, k_{j}) \in E$

(5)

2.3. Objective Function

The goal of learning path optimization in this study is to determine the best learning path sequence to optimize specific teaching indicators. In order to comprehensively evaluate the performance and generalization ability of the proposed Dueling DQN-HH framework, we have defined five target functions. Among these,

F_{time}

serves as the primary optimization target for the main experiments, while the remaining four objective functions are employed to validate the algorithm’s adaptability across diverse teaching scenarios.

2.3.1. Primary Objective

This objective function simulates real cognitive processes, aiming to minimize review costs caused by forgetting through rational sequencing while ensuring learning efficiency. Based on cognitive load theory [5] and the Ebbinghaus forgetting curve [4], total learning time comprises two components: foundational cognitive time and review penalty time. Let

S = (s_{1}, s_{2}, \dots, s_{n})

denote the sequence of knowledge points to be learned, where

s_{t}

represents the t-th knowledge point to be learned. The primary objective function

F_{time}

is defined as follows:

\min F_{time} (S) = \sum_{t = 1}^{n} [C_{base} (s_{t}) + C_{review} (s_{t})] .

(6)

where

C_{base} (s_{t})

represents the basic cognitive time, which reflects the inherent time required to master a knowledge point itself, which is determined by the topological characteristics of the knowledge graph:

C_{base} (s_{t}) = T_{0} + β_{1} \cdot d_{i n} (s_{t}) + β_{2} \cdot d_{o u t} (s_{t}) .

(7)

T_{0}

is the reference time,

d_{in} (\cdot)

and

d_{out} (\cdot)

denote the in-degree and out-degree of the node, respectively;

β

is the corresponding weight coefficient.

C_{review}

denotes the review penalty time. When a learner studies

s_{t}

, if its prerequisite knowledge point

k_{p} \in Pre (s_{t})

has already been learned and the interval since learning it is relatively long, additional review time is required. Here,

Pre (s_{t})

denotes the set of knowledge points that must be mastered prior to studying

s_{t}

.

C_{review} (s_{t}) = \sum_{k_{p} \in Pre (s_{t})} α \cdot Δ (k_{p}, s_{t}) .

(8)

Δ (k_{p}, s_{t}) = pos (s_{t}) - pos (k_{p})

denotes the number of steps between two knowledge points in the sequence, where

α

is the forgetting factor. This term compels the algorithm to favor placing related knowledge points in adjacent positions.

2.3.2. Supplementary Verification Target

In order to evaluate the generalization ability of the Dueling DQN-HH framework, four additional target functions were introduced in this study. These functions cover different teaching perspectives—structural consistency, learning efficiency and multi-objective trade-offs—and show different mathematical characteristics, such as extreme values, averages, and weighted sums. They are used to evaluate the robustness of the framework in various optimization environments. These target functions are not jointly optimized at the same time; they are used to build an optimization environment with different characteristics so as to evaluate the adaptability of the proposed method in the heterogeneous search space.

Minimization of Maximum Prerequisite Distance ( $F_{dist}$ ): This objective aims to prevent severe cognitive gaps by shortening the span between a knowledge point and all its prerequisites. It is defined as:

$\min F_{dist} (S) = \max_{t \in {1, \dots, n}} (\max_{k_{p} \in Pre (s_{t})} Δ (k_{p}, s_{t})) .$

(9)

Characterized by extreme value optimization, this function poses a significant challenge to the algorithm’s ability to escape local optima.
Maximization of Knowledge Density ( $F_{density}$ ): Focused on learning efficiency, this objective aims to concentrate foundational concepts with high support capacity in the early stages of the path to build a solid cognitive base for subsequent learning. Its expression is:

$\max F_{density} (S) = \frac{1}{⌊ n / 2 ⌋} \sum_{t = 1}^{⌊ n / 2 ⌋} d_{out} (s_{t}) .$

(10)
Maximization of Early Gain ( $F_{gain}$ ): This objective emphasizes motivation maintenance by introducing an exponential decay factor $λ$ (set to $λ = 0.1$ in this study), encouraging the algorithm to schedule high-value (high out-degree) knowledge units as early as possible. Its mathematical form is:

$\max F_{gain} (S) = \sum_{t = 1}^{n} e^{- λ (t - 1)} \cdot d_{out} (s_{t}) .$

(11)

Compared to $F_{density}$ , this function provides a smoother search space with global weighting.
Balanced Objective ( $F_{bal}$ ): Embodying the concept of multi-objective trade-offs, this objective utilizes a weighting factor $ω$ to balance the total learning time $F_{time}$ and the overall distribution of cognitive load. It is defined as:

$\min F_{bal} (S) = ω_{bal} \cdot \frac{F_{time} (S)}{T_{\max}} + (1 - ω_{bal}) \cdot \frac{Var (C_{base})}{V_{\max}}$

(12)

where $Var (C_{base})$ denotes the variance of the basic cognitive time of each knowledge point in the sequence, serving as a metric for the stability of learning intensity. $T_{\max}$ and $V_{\max}$ are the normalization coefficients for the total learning time and cognitive intensity variance, respectively. Specifically, $T_{\max}$ represents the theoretical upper bound of the total learning time, typically estimated by the sum of the maximum possible learning times of all knowledge points. $V_{\max}$ denotes the maximum achievable variance of basic cognitive time within the search space, which serves as a scale factor to ensure both objectives are mapped to the interval $[0, 1]$ to avoid any single objective dominating the optimization process.

3. Methodology

3.1. General Framework

The proposed Dueling DQN-HH framework builds an adaptive optimization process based on feedback, and it relies on constant interaction between the agent and the learning path search. At each decision step, the agent observes the current state and selects a suitable low-level operator to change the current solution. This update produces reward signals, and these signals are used to train the network over time, which allows the search to move through complex solution spaces. The overall system structure is shown in Figure 1. The control of low-level operators is treated as a sequential decision problem, so the scheduling strategy can be learned during training. The operator choice does not depend on fixed rules or manual design, and it is formed gradually through long-term reward feedback.

3.2. Domain Isolation Policy

The domain isolation strategy, which combines priority-based encoding and greedy decoding, serves as the foundation for the framework’s cross-domain adaptability. By decoupling the high-level policy learning from specific prerequisite constraints, the Dueling DQN-HH can be applied to other constrained optimization problems simply by redefining the decoding logic and objective functions, without requiring internal architectural changes.

3.2.1. Encoding Strategy

To separate the high-level control strategy of the hyper-heuristic framework from domain-specific characteristics, this work adopts a priority-based representation derived from the random-key mechanism [19,20]. In this encoding scheme, candidate solutions are mapped to continuous-valued vectors, with each component associated with a specific learning knowledge. The assigned numerical value indicates the relative execution precedence of the corresponding knowledge when constructing the learning sequence.

For a knowledge graph

G = (K, E)

containing N knowledge nodes, each solution is encoded as an N-dimensional real vector

x \in R^{N}

:

x = [x_{1}, x_{2}, \dots, x_{N}], where x_{i} \in [0, 1] .

(13)

where the weight

x_{i}

represents the relative priority score of the i-th knowledge point

k_{i}

. The adopted encoding scheme offers clear flexibility because it does not place fixed limits on the search space. This design allows low-level operators to apply crossover, mutation, or gradient-based changes directly in a continuous space without creating invalid solutions. All constraint checks are handled during decoding, and this choice makes operator design much simpler. When solutions are expressed as continuous priority values, the search space stays smooth, which helps the method work well with gradient-based optimization and reinforcement learning methods.

3.2.2. Decoding Mechanism

The decoder’s task is to establish a mapping function

D : R^{N} \to S

, transforming the priority vector

x

into a valid learning sequence

S = 〈 s_{1}, s_{2}, \dots, s_{N} 〉

that satisfies the DAG constraints. This study proposes a priority-based greedy topological sorting algorithm [21], with the following specific steps:

Initialization: Set $S = \emptyset$ as the empty sequence. Compute the in-degree $d_{i n} (v)$ for all nodes $v \in K$ in the graph. Initialize the candidate set C, to include all nodes currently without prerequisites:

$C = {v \in K ∣ d_{i n} (v) = 0} .$

(14)
Sequence Construction: Perform N iterations. At each step k ( $k = 1 \dots N$ ):
- Priority-based greedy selection: Select the node $v^{*}$ with the highest priority score from the candidate set C and add it to the sequence:
  
  $v^{*} = arg \max_{v \in C} {x_{v}} .$
  
  (15)
- Sequence Update and Constraint Removal: Add $v^{*}$ to the final learning path sequence S and logically remove it from the graph. Then, traverse all successor nodes $u \in Succ (v^{*})$ and decrement their in-degree by 1.
Candidate Set Update: When the in-degree of a successor node drops to 0, all prerequisite relationships associated with that node have been resolved, allowing it to enter the selectable pool. The candidate set is then modified according to the following state transition rule:

$\begin{matrix} C_{k + 1} = & (C_{k} ∖ {v^{*}}) \\ \cup {u \in Succ (v^{*}) ∣ d_{in} (u) = 0 .} \end{matrix}$

(16)

C_{k}

denotes the candidate set for the current step,

v^{*}

represents the selected node to be removed, and

Succ (v^{*})

denotes the set of direct successor nodes of

v^{*}

.

Figure 2 shows an example of the proposed method, illustrating how the priority score is used during sequence building while all topological rules are still met. Even when the algorithm is forced to pick elements with lower priority, the decoding process keeps the final sequence valid.

The detailed logic of this mapping process is visualized in Figure 3. The flowchart demonstrates how the priority scores guide the incremental construction of the sequence while the candidate set C strictly enforces topological feasibility. A critical feature of this mechanism is its ability to maintain 100% validity even when high-priority nodes are temporarily bypassed due to prerequisite constraints.

Specifically, as shown in the numerical progression of Figure 2, if a priority vector is set as

[0.2, 0.8, 0.6, 0.4, 0.3]

for nodes

{A, B, C, D, E}

, the decoder first selects B (0.8) as it is a root node. Subsequently, although C (0.6) has a higher priority than D (0.4), D is selected first because C’s prerequisite (A) has not been met. This ensures the output sequence

S = 〈 B, D, A, C, E 〉

remains 100% topologically valid regardless of the high-level agent’s priority assignments. Ties in priority scores are resolved by node indices.

3.3. Low-Level Heuristic Operator Library

A core library of ten general heuristic operators is built to balance exploration and exploitation. It includes commonly used methods—such as mutation, crossover, simulated annealing [22], and variable neighborhood search [23]—to ensure the method works well across different problem cases. Each operator has different time costs, which lets the algorithm adjust the solution quality and running time during the search. These operators are not tied to specific problem domains. They act directly on real-valued priority vectors; no need for problem-specific knowledge. The set does not try to cover every possible heuristic. Instead, it includes diverse search behaviors—with different neighborhood sizes, acceptance rules, and time costs—so the high-level controller can learn effective strategies. As grouped in Figure 4 and detailed in Table 1, the operators are split into two types [24]: perturbation-based operators (LLH1–LLH5) for global exploration and local search operators (LLH6–LLH10) for local refinement.

Perturbation-based operators boost global exploration by making big changes to the solution structure. They rely on mutation and recombination. LLH1 swaps priorities randomly, while LLH3 swaps elements with similar values. LLH2 [25] uses adaptive noise—adjusted based on past results—to keep the search moving forward while avoiding poor local solutions. Crossover operators (LLH4–LLH5) combine information from parent solutions: LLH4 keeps high-quality elements to speed up improvement, and LLH5 uses multi-point crossover to expand the search space.

Local search operators refine current solutions through repeated small improvements. LLH6 [26] and LLH8 use greedy rules to speed up progress, while LLH7 uses the Metropolis rule to escape local traps. LLH9 adjusts neighborhood size dynamically, and LLH10 [27] cycles through perturbation and refinement loops to drive continuous solution improvement.

The selection of these ten operators is designed to ensure structural diversity in the search process. By integrating perturbation-based operators for broad exploration and local search operators for fine-grained exploitation, the hyper-heuristic can adaptively switch search strategies based on the current landscape, effectively mitigating the risk of premature convergence in complex DAG spaces.

3.4. High-Level Strategy Design: Dueling DQN

This study uses an improved Dueling Deep Q-Network as the high-level control for the hyper-heuristic method. The operator selection space is discrete, and the learning process must handle delayed feedback during long decision sequences, so a value-based Dueling DQN fits this task better than policy-based methods. The model does not change the solution directly, and it selects a suitable heuristic operator from the operator library based on the current search state.

3.4.1. State Space

During the iterative search, the agent describes the current search condition using a dynamic state feature vector. While this study employs a 68-dimensional vector to accommodate the 10 selected low-level operators, the state representation is fundamentally modular. For a heuristic library of size L, the total dimensionality D is defined by

D = 38 + 3 L

. This ensures that the framework can be adapted to different operator sets without structural redesign. The baseline solution feature reflects the current convergence level through objective values and diversity measures. The relation distance feature measures how far the sequence is from topological rules by using the Kendall tau distance [28] and counts of rule violations. The operator performance feature provides short-term memory by tracking the recent success rates and average gains of each operator. The global progress feature uses normalized steps and stagnation counts, and it helps the agent balance exploration and exploitation. The detailed definition and dimensional distribution of the 68-dimensional state feature vector are summarized in Table 2. A more comprehensive description of each feature dimension is provided in Appendix A.

3.4.2. Action Space

To endow the intelligent agent with the capability to flexibly schedule operators, let

a_{i}

represent all operations required by the

L L H_{i}

. The action space A is defined as the union of the perturbation class and local search class action sets. Let the perturbation class operator set be

O_{perturb} = {L L H_{1}, \dots, L L H_{5}}

, and the local search class operator set be

O_{local} = {L L H_{6}, \dots, L L H_{10}}

. The action space A is formulated as

A = O_{perturb} \cup O_{local} = {a_{1}, a_{2}, \dots, a_{1} 0} .

(17)

When an agent selects action

a_{t}

, it directly triggers the corresponding operator to execute. This mechanism enables agents to adaptively generate arbitrary sequences of operators in the form

〈 o p_{perturb}, o p_{local}, o p_{local}, \dots 〉

, effectively breaking through the structural constraints of traditional fixed “global–local” operator combinations. Furthermore, based on computational complexity, each action is assigned a specific strategic role and normalized computational cost, as shown in Table 3. Cost values reflect differences in temporal complexity: lightweight perturbation operators carry minimal cost for rapid experimentation, while heavyweight global search operators incur high overhead. The cost parameters are manually predefined and fine-tuned during experimentation to optimize the algorithm’s performance. The algorithm employs a cost penalty mechanism to restrict their invocation only when expected returns are sufficiently high.

3.4.3. Reward Function

Traditional reward functions typically rely solely on the improvement in the objective function (

Δ f

), which often leads agents to over-depend on computationally intensive heavyweight operators or become trapped in local optima during search plateaus due to insufficient feedback. To address this, this study designs a reward function mechanism centered on cost awareness. By incorporating nonlinear normalized returns and a multi-level exploration mechanism, the agent is guided to seek an optimal balance between “return–cost” and “exploration–exploitation.” The total reward

R_{total}

is composed of the following five weighted components:

\begin{matrix} R_{total} = & R_{main} + R_{shaping} + R_{exploration} \\ + R_{behavior} + R_{penalty .} \end{matrix}

(18)

The core mechanism is the cost-aware main reward

R_{main}

, designed to address the issue of vastly differing computational costs among operators. This paper proposes a cost penalty mechanism based on a nonlinear normalized improvement rate. Unlike simple linear interpolation, this mechanism is more sensitive to minor improvements while smoothing out substantial ones. Let

f_{old}

and

f_{new}

denote the objective function values before and after executing an operator, respectively.

Δ f

is defined as

Δ f = f_{old} - f_{new .}

(19)

The primary reward

R_{main}

is defined as a piecewise function to handle different optimization outcomes:

R_{main} = \{\begin{matrix} 50 \cdot \sqrt{\frac{Δ f}{| f_{old} |}} - λ \cdot C_{op} (a_{i}), & if Δ f > 0; \\ - 0.5 \cdot λ \cdot C_{op} (a_{i}), & if Δ f = 0; \\ - 20 \cdot \frac{| Δ f |}{| f_{old} |}, & if Δ f < 0 . \end{matrix}

(20)

where

C_{op} (a_{i})

denotes the normalized computational cost of operator

a_{i}

, reflecting the intrinsic variance in temporal complexity across different heuristics. The term

λ

serves as the cost sensitivity factor, which modulates the agent’s responsiveness to the consumption of computational resources.

The mathematical rationale for this penalty term is to transform the reward from a single-objective improvement measure into a “resource-efficiency” metric. This transition is inspired by the efficiency-centric evaluation frameworks utilized in other complex optimization tasks, such as the power-balanced trajectory planning in UAVs [15] and the redundancy-constrained parameter searches in engineering modeling [16]. By adopting this “benefit-to-cost” assessment logic, when

Δ f > 0

, the square root term smoothes the normalized improvement to ensure lightweight operators remain competitive with heavyweight ones, optimizing computational cost-effectiveness. For non-improving cases, the reward distinguishes between resource waste (

Δ f = 0

) and solution deterioration (

Δ f < 0

). In the latter, the penalty is proportional only to the degree of degradation. Notably, cost penalties are not superimposed here to prevent network collapse caused by double negative feedback.

In addition to the main reward, the reward design includes a shaping reward

R_{shaping}

[29] to reduce the problem of sparse feedback in reinforcement learning. This shaping reward contains two parts, and they are a distance-based reward

R_{dist}

and a global proximity reward

R_{prox}

, which help guide the learning process during the search.

\begin{matrix} R_{shaping} = & ω_{shaping} \cdot Δ Dist \\ + \max (0, \frac{50 - f_{new} - f_{best}}{10}) \end{matrix}

(21)

where

Δ Dist

denotes the reduction in the average forward dependency distance, and

f_{best}

represents the historical best objective value achieved along the current search trajectory. This term provides the agent with gradient guidance when the objective function value does not change significantly.

To keep population diversity and to find effective operators, the reward design includes an exploration reward based on operator usage and a behavior reward based on success rate. The exploration reward

R_{exploration}

is set according to the usage rate

u (a_{i})

of each operator within a sliding window, and this value changes during the search.

R_{exploration} = \{\begin{matrix} + 5.0, & if u (a_{i}) < 3 % \\ + 3.0, & if 3 % \leq u (a_{i}) < 5 % \\ + 1.0, & if 5 % \leq u (a_{i}) < 10 % \\ + 0.0, & if 10 % \leq u (a_{i}) \leq 30 % \\ - 3.0, & if u (a_{i}) > 30 % \end{matrix}

(22)

The system also forms a closed-loop feedback mechanism through behavioral rewards

R_{behavior}

and dynamic penalties

R_{penalty}

. Specifically, the agent rewards operators exhibiting high success rates and significant historical improvements based on real-time performance tracking. Conversely, it imposes dynamic penalties on operators that fail to improve the solution or trigger the rollback mechanism due to catastrophic degradation, thereby preventing training instability [30].

R_{behavior}

is used to reward the long-term stability of operators. It is defined as:

R_{behavior} = Φ (S_{r}) + Ψ (\bar{Δ f}) .

(23)

where

S_{r}

represents the operator success rate.

Φ (S_{r}) = \{\begin{matrix} + 4.0, & if S_{r} > 0.7 \\ + 2.0, & if 0.5 < S_{r} \leq 0.7 \\ - 3.0, & if S_{r} < 0.3 and N_{a} > 20 \\ 0, & otherwise \end{matrix}

(24)

N_{a}

represents the cumulative call count of the operator.

\bar{Δ f}

denotes the historical average improvement.

Ψ (\bar{Δ f}) = \{\begin{matrix} ω_{h} \cdot μ_{Δ f}, & if \bar{Δ f} > λ_{h} \cdot μ_{Δ f} \\ ω_{m} \cdot μ_{Δ f}, & if λ_{m} \cdot μ_{Δ f} < \bar{Δ f} \leq λ_{h} \cdot μ_{Δ f} \\ 0, & otherwise \end{matrix}

(25)

where

μ_{Δ f}

is the expectation of improvement representing the historical average gain across all operators to couple the reward benchmark with the search difficulty;

λ_{h}

and

λ_{m}

denote relative thresholds defining performance tiers to identify superior operators; and

ω_{h}

and

ω_{m}

represent reward scaling factors that modulate the magnitude of this behavioral component within the total reward

R_{total}

.

R_{penalty}

is used to constrain inefficient operators in real-time and includes the blacklist mechanism and rollback penalty. It is defined as:

R_{penalty} = \min (R_{blacklist}, R_{rollback}) .

(26)

R_{blacklist}

and

R_{rollback}

are defined to address long-term inefficiency and instantaneous deterioration, respectively.

R_{blacklist} = \{\begin{matrix} - 5.0, & if S_{r} (a_{i}) < θ_{fail} and N_{a} > N_{\min} \\ 0, & otherwise \end{matrix}

(27)

where

S_{r} (a_{i})

denotes the recent success rate of operator

a_{i}

, and

θ_{fail}

is the trigger threshold (e.g.,

5 %

), ensuring the penalty is triggered only after sufficient statistical samples are accumulated.

R_{rollback} = \{\begin{matrix} - 3.0, & if f_{new} > f_{old} \cdot (1 + δ) \\ 0, & otherwise \end{matrix}

(28)

where

δ

denotes the tolerance deviation.

The internal logic of the composite reward mechanism is visualized in Figure 5. It illustrates how raw inputs—such as performance gains (

Δ f

), operator costs (

C_{o p}

), and historical statistics—are processed by five parallel modules and aggregated into the final reward

R_{t o t a l}

. This structure ensures the agent receives dense, multi-dimensional feedback to guide the optimization.

3.4.4. Network Architecture

The Dueling DQN framework in this study improves learning efficiency in high-dimensional state spaces where reward signals are sparse by separating state value learning from action advantage learning. To reduce training instability and limit value fluctuation [31,32], the framework uses two deep neural networks with the same structure, including a policy network

θ

that is updated during training and a target network

θ^{-}

that is updated at fixed intervals. Both networks follow the dueling design, and each network maps the normalized state vector

g_{t} \in R^{N_{i n}}

to Q-values over the action set A. Figure 6 shows the detailed structure of the network.

During the feature extraction phase, the network first maps the raw states to nonlinear high-order feature representations through a shared multilayer perceptron. Let

f_{share} (\cdot; θ_{s})

denote the mapping function of the shared layer, parameterized by

θ_{s}

.

θ_{s} = {W_{1}, b_{1}, W_{2}, b_{2}}

represents the set of all trainable parameters (weights and biases) in the shared layers. This process combines the ReLU activation function with dropout regularization [33] to enhance the model’s generalization capability. The mathematical expression is:

\begin{matrix} h_{t} = & ReLU (W_{2} \cdot Dropout \\ (ReLU (W_{1} g_{t} + b_{1})) + b_{2}) \end{matrix}

(29)

where

h_{t}

is the resulting abstract feature vector of the output.

Subsequently, to overcome the limitation of traditional DQN in distinguishing between state value and action advantage, the state

g_{t}

is first processed by the shared layer to extract the abstract feature vector

h_{t}

, which is then fed into the dual-stream branching architecture of the Dueling framework. The value stream employs the transformation

ϕ_{v}

to estimate the scalar state value

V (g_{t}; θ_{v})

, while the advantage stream uses the transformation

ϕ_{a}

to estimate the advantage vector

A (g_{t}, a; θ_{a})

. The calculation formulas for both are as follows:

V (g_{t}; θ_{v}) = ϕ_{v} (h_{t}), A (g_{t}, a; θ_{a}) = ϕ_{a} (h_{t})

(30)

Ultimately, to ensure the unique identifiability of V and A, the network employs a mean-centered constraint at the output to aggregate the two streams. The final action value function

Q (g_{t}, a; θ)

is defined as:

Q (g_{t}, a; θ) = V (g_{t}; θ_{v}) + (A (g_{t}, a; θ_{a}) - \frac{1}{| A |} \sum_{a^{'} \in A} A (g_{t}, a^{'}; θ_{a}))

(31)

where

θ = {θ_{s}, θ_{v}, θ_{a}}

represents the complete set of trainable parameters, and

| A |

denotes the size of the action space. The variable

a^{'} \in A

is a placeholder used to iterate over all possible operators in the action space during the calculation of the average advantage. By subtracting the mean of the advantage values, this aggregation formula forces

V (g_{t})

to directly approximate the actual value of the state. This enables the network to converge rapidly even when actions have minimal impact on states with small differences.

3.4.5. Training Strategy

This experiment employs a Dueling DQN algorithm integrated with Prioritized Experience Replay (PER) to optimize network parameters

θ

. By minimizing temporal difference errors, the algorithm leverages the structural advantages of the dueling architecture to mitigate value fluctuation risks and improve sample efficiency. The target value

y_{t}

is formulated based on the standard DQN mechanism:

y_{t} = r_{t} + γ \max_{a^{'}} Q (g_{t + 1}, a^{'}; θ)

(32)

where

r_{t}

denotes the immediate reward obtained by the agent after executing action

a_{t}

in state

g_{t}

, and

γ

is the discount factor. The loss function

L (θ)

consists of a weighted mean squared error and an entropy regularization term

H

. The entropy coefficient

λ

is introduced to encourage exploration and prevent premature convergence:

L (θ) = \frac{1}{B} \sum_{i = 1}^{B} w_{i} \cdot {(y_{i} - Q (g_{i}, a_{i}; θ))}^{2} - λ H (π (\cdot ∣ g_{i}))

(33)

where B is the mini-batch size, and

π (\cdot ∣ g_{i})

is the action probability distribution derived from the Q-values. In the experience replay module, the probability

P (i)

of sampling a transition i is proportional to its prediction error

| δ_{i} |

, where

δ_{i} = y_{i} - Q (g_{i}, a_{i}; θ)

represents the TD-error. Transitions with larger errors receive higher replay priority:

P (i) = \frac{{(| δ_{i} | + ϵ)}^{α}}{\sum_{k} {(| δ_{k} | + ϵ)}^{α}}

(34)

where

ϵ

is a small positive constant ensuring all samples have a non-zero probability of being selected and

α

controls the strength of prioritization. To correct distribution biases introduced by non-uniform sampling, an importance sampling weight

w_{i} = {(N \cdot P (i))}^{- β}

is incorporated during gradient updates, where N is the current buffer size and

β

is the bias correction coefficient. The complete Dueling DQN-HH training and parameter update process is summarized in Algorithm 1.

Algorithm 1 Dueling DQN-HH training and parameter update process

Require:: Max episodes M, Batch size B, $α, β$
1:: Initialize $θ$ , Buffer D
2:: for episode $e = 1$ to M do
3:: Initialize state $s_{t}$
4:: for step $= 1$ to T do
5:: Select $a_{t}$ based on $Q (s_{t}, \cdot; θ)$
6:: Execute $o p_{a_{t}}$ , observe $r_{t}, s_{t + 1}$
7:: Store $(s_{t}, a_{t}, r_{t}, s_{t + 1})$ in D
8:: if $| D | \geq B$ then
9:: Sample Batch by $P (i)$
10:: Compute $w_{i} = {(N \cdot P (i))}^{- β}$
11:: Calculate $y_{t}$
12:: Update $θ$
13:: end if
14:: $s_{t} \leftarrow s_{t + 1}$
15:: end for
16:: end for

4. Experiments and Results Analysis

4.1. Dataset

This study uses the MOOCCube dataset as a benchmark to construct a concept dependency graph in the computer science field. We carried out a rigorous preprocessing procedure on the original data to ensure the logical validity of learning path planning. We first extracted the core concept nodes of the field. Then, we filtered out low-confidence edges, keeping only the “predecessor–successor” dependencies annotated by experts. Finally, we used the weakly connected component algorithm to extract the maximum connected subgraph and verified the network through topological sorting to ensure it strictly meets the directed acyclic graph constraint.

The final constructed experimental graph

G = (K, E)

contains

N = 287

knowledge points and

M = 725

prerequisite dependency edges. As shown in Figure 7, the network has a highly sparse structure, and it strictly follows the topological constraints of a directed acyclic graph. This structure leads to a severe sparse reward problem in the reinforcement learning environment. The hierarchical distribution analysis in Figure 8 also reveals that the graph has significant deep-level dependency characteristics. This means the algorithm needs to handle complex sequence decision problems in a huge combinatorial space while satisfying complex prerequisites.

4.2. Experimental Setup and Benchmark Algorithm

All algorithms are implemented in Python 3.11, with the deep reinforcement learning module built on the PyTorch 2.9.1 framework. Each experiment runs independently 10 times to reduce random factor impacts, and all use a unified termination condition: 4000 maximum function evaluations. For the proposed Dueling DQN-HH, parameters are tuned via preliminary experiments: replay pool size = 2000, batch size = 64, and Adam optimizer (learning rate = 0.001). All experiments are conducted on a on a Lenovo laptop with an Intel i5 processor and NVIDIA GeForce MX 350 graphics card.

To comprehensively and fairly evaluate the proposed method, various representative metaheuristic and swarm intelligence algorithms are selected as benchmarks. These include stochastic hyperheuristic (SRHH), genetic algorithm hyperheuristic (GA-HH), iterative local search (ILS-HH), variable neighborhood search (VNS-HH), culture gene algorithm [34] (MA-HH), particle swarm optimization (PSO), and ant colony optimization (ACO). The ACO algorithm follows the implementation in reference [35].

PSO and ACO adhere to their original search mechanisms from the respective literatures. All other benchmark hyperheuristics are implemented based on this paper’s unified underlying operator library—their typical search logic is reproduced by restricting callable operator subsets. For example, SRHH randomly selects operators from the full library; GA-HH only uses crossover and mutation; ILS-HH and VNS-HH employ perturbation and neighborhood transformation operators, respectively; and MA-HH combines crossover, mutation, and hill-climbing to balance global and local search.

All algorithms share identical function evaluation budgets and termination conditions. This setting eliminates interference from implementation differences and resource allocation gaps, ensuring results truly reflect the decision-making and learning capability differences of high-level search strategies.

4.3. Performance Analysis of Minimizing Learning Time

This paper takes minimizing total learning time as the main objective, comparing the proposed Dueling DQN-HH framework with representative benchmark algorithms to systematically evaluate its optimization capability under strong preconditioning.

As shown in Figure 9, Dueling DQN-HH maintained a continuous, stable performance improvement throughout the search process. In contrast, GA-HH, ILS-HH, VNS-HH and other methods generally entered a performance plateau after 1000 function evaluations, with little room for further improvement. This indicates that in search spaces with complex DAG topological constraints, algorithms relying on fixed or rule-driven operator scheduling are prone to premature convergence. Dueling DQN-HH, however, retains long-term exploration capabilities via its dynamic operator selection mechanism—and achieved the lowest

F_{time}

among all compared methods. Statistical results in Table 4 confirm its superior mean and median values. Its solution distribution is also concentrated, with a coefficient of variation of only 1.1%. The Wilcoxon rank-sum test shows statistically significant differences between Dueling DQN-HH and all benchmarks (

p < 0.05

), and Cohen’s d absolute values all exceed 0.8—proving the practical significance of this advantage.

Beyond solution quality, Dueling DQN-HH also demonstrates excellent optimization efficiency: under the same function evaluation budget, it achieves the highest efficiency per unit evaluation (Eff. = 0.3928 h/eval), enabling continuous improvement with fewer invalid evaluations. The efficiency metric (Eff) is defined as the average reduction in total learning time per function evaluation. A higher Eff value indicates that the algorithm can identify more significant optimizations with fewer computational resources. This advantage stems from its cost-aware operator scheduling mechanism: it prioritizes lightweight perturbation operators for rapid early exploration, introducing high-cost operators only when necessary to balance solution quality and computational consumption.

To further reveal the distributional evolution of planning paths within the knowledge topology, this study employs a tree layout to visualize the optimal path in a stage-wise manner (Figure 10). Structural analysis shows that Dueling DQN-HH exhibits a significant semantically coherent planning structure over a graph with 287 knowledge nodes, where the path evolution can be divided into three progressive stages: foundational construction, core integration, and global expansion.

In the foundational construction stage (Steps 1–95), the algorithm prioritizes low in-degree foundational meta-knowledge nodes to rapidly establish a stable underlying structure, such as reset circuits, sorting algorithms, and numeral system conversion. During the core integration stage (Steps 96–190), the path converges toward the system layer with the densest dependencies, frequently covering key concepts such as virtual memory, interrupt systems, and concurrency control to achieve deep software–hardware associations. Finally, in the global expansion stage (Steps 191–287), the path expands toward application-level nodes, including distributed systems and network protocols, completing the transition from standalone system principles to modern Internet architectures.

Overall, the evolution pattern from shallow to deep and from local to global is consistent with the cognitive construction process in computer science, validating the planning effectiveness of the proposed framework and revealing the agent’s capability to organize and reason over large-scale heterogeneous knowledge topologies.

Combined with the operator evolution process in Figure 11, the algorithm tends to call reorganization operators like biased crossover and multi-point crossover rather than local fine-tuning operators under this optimization objective. This indicates that global structural reorganization plays a more important role in minimizing learning time. The underlying mechanism of this operator scheduling behavior and its adaptability across different optimization objectives will be further analyzed in Section 4.6.

4.4. Ablation Experiment

4.4.1. Network Architecture Ablation Analysis

This section uses ablation experiments to examine how key network components affect Dueling DQN-HH’s overall performance. It compares four network settings: the full Dueling DQN-HH model, a version with Double DQN structure, a variant without priority experience replay (−(PER)), and a basic Pure DQN-HH that removes the Dueling structure.

Faster learning and more stable training are facilitated by the full Dueling DQN-HH, as is illustrated in Figure 12. Pure DQN-HH has much slower learning and larger fluctuations in final results—indicating a single Q-network cannot reliably handle operator scheduling under sparse rewards and long decision sequences. Adding Double DQN reduces overestimation and aids learning but still underperforms the full model.

Quantitative results are presented in Table 5 that further confirm these observations. The full model achieves the best performance in both

F_{time}

and result stability. Removing PER leads to a similar convergence trend but a worse final performance ceiling and standard deviation. This proves PER enhances the utilization efficiency of key information samples in complex search environments, improving policy learning stability.

To sum up, the Dueling architecture has a core role: it separates state value from action advantage and improves the stability of value estimation. Double DQN and the PER mechanism work well together; they curb estimation bias and speed up convergence. The combined effect of these three parts forms the key basis for Dueling DQN-HH’s steady performance edge in complex learning path optimization tasks.

4.4.2. Analysis of the Effectiveness and Efficiency of Reward Mechanisms

This section compares the complete model with variants (missing different reward functions) under the same parameters to evaluate how each reward component affects search performance and operator scheduling. Unlike the main experiment—which focuses on overall optimality—these ablation experiments highlight performance gaps from missing mechanisms to show the reward design’s real impact on decision-making.

Experimental results confirm that reward shaping and the exploration mechanism jointly guarantee optimization accuracy. As shown in Figure 13a, removing reward shaping sharply slows the model’s early improvement speed—this mechanism provides frequent positive feedback when the objective function changes little, easing slow convergence. Removing the exploration mechanism also drops the algorithm’s average improvement rate to 7.4, the lowest among variants (Table 6). This proves it helps escape local optima and maintain search diversity.

The cost-aware mechanism boosts computational efficiency; its impact on the final solution quality is limited. It curbs overuse of high-cost, low-return operators, enabling a more reasonable operator mix: adding this mechanism reduces the algorithm’s wall time from 849.4 s to 598.8 s, a nearly 30% efficiency gain. This proves integrating operator computation costs into rewards guides the agent to balance solution quality and resource use.

In short, the three reward modules are complementary: reward shaping accelerates early convergence, the exploration mechanism sets the search performance lower bound, and the cost-aware mechanism boosts efficiency. Together, they let Dueling DQN-HH balance quality, stability, and resource efficiency in complex optimization environments.

4.5. Supplementary Verification

This section further validates the generalization performance of the Dueling DQN-HH framework by changing the optimization objective and dataset.

4.5.1. Performance Verification of Different Objective Functions

To verify the framework’s adaptability across teaching scenarios, four representative objective functions are selected for comparative experiments—Distance, Balanced, Density, and Early Gain, as defined in Section 2.3.2. Benchmark algorithms for comparison include SRHH, GA-HH, ILS-HH, VNS-HH, MA-HH, PSO, and SEACO.

As shown in Figure 14, Dueling DQN-HH exhibits strong convergence and stable performance advantages across all objective functions. For the minimization tasks (Distance and Balanced, Figure 14A,B), the method rapidly reduces the objective value in the early search stage and maintains it at a low level throughout optimization, demonstrating excellent convergence efficiency and local search capability. For the maximization tasks (Density and Early Gain, (C) and (D)), it quickly reaches a high objective value and retains its leading edge in later iterations. This confirms that its reward mechanism effectively guides the model to explore complex solution spaces continuously.

Box plot analysis in Figure 15 further clarifies the performance distribution and stability differences between algorithms. Dueling DQN-HH achieves optimal or near-optimal statistical performance across all four objectives: it yields higher median values for Density and Early Gain and lower cost values for Distance and Balanced. Meanwhile, it shows a smaller interquartile range and fewer outliers, indicating high stability and reproducibility under random initial conditions. In contrast, SEACO displays obvious performance fluctuations on some objectives, and PSO is prone to local optima.

These results confirm that Dueling DQN-HH is not tailored to a single objective function. It maintains consistent performance advantages across objectives with different mathematical features and optimization directions, proving its high-level operator scheduling strategy has good objective-independent adaptability.

4.5.2. Cross-Dataset Generalization Validation

To evaluate the architectural re-applicability of the Dueling DQN-HH framework, a new agent was initialized and trained from scratch on the psychology knowledge dataset (PsyDataset), rather than directly transferring the policy learned from the computer science (CS) dataset. This PsyDataset contains 431 nodes and is introduced here for comparative evaluation to further verify the generalization ability of the proposed Dueling DQN-HH framework in heterogeneous knowledge graph environments. Compared to the computer science dataset, PsyDataset exhibits a more densely interconnected structure. As shown in Figure 16, this dataset has a highly modular topology, with knowledge dependency patterns quite different from those in the computer science field—it thus provides a representative test scenario for checking the algorithm’s cross-domain adaptability.

Experimental results in Figure 15 show Dueling DQN-HH still maintains high global search efficiency and stable optimization performance in this heterogeneous environment. Specifically, the convergence curve in Figure 17a shows the method improves rapidly in early search stages and finally stabilizes total learning time at around 11,200 h. Statistical results in (b) and (c) confirm its high robustness across multiple independent runs, with an average improvement rate over 35% that clearly outperforms baseline algorithms. Beyond overall performance, changes in operator scheduling behavior further highlight the model’s adaptive features. As shown in Figure 15D, the algorithm calls operators like Multi-point Crossover and Heuristic Mutation more frequently on PsyDataset than some global reorganization operators, while also increasing the use frequency of local search operators. This scheduling adjustment proves Dueling DQN-HH can dynamically adjust operator combinations according to the topological characteristics of knowledge graphs, maintaining effective search efficiency while avoiding damage to modular dependencies.

The proposed Dueling DQN-HH framework does not rely on structural assumptions unique to a single dataset. It adaptively adjusts its operator scheduling strategy based on the topological characteristics of different knowledge graphs, enabling stable and effective optimization performance in cross-domain learning path optimization tasks.

4.6. Operator Evolution and Strategy Analysis

This section systematically analyzes operator scheduling behavior from the perspective of dynamic search process evolution to explore how Dueling DQN-HH’s strategy forms under different optimization objectives and datasets. To avoid single evaluation fluctuations interfering with trend judgment, a stage division criterion is defined based on the sliding window smooth improvement rate. This criterion splits the search process into exploration and exploitation periods—characterizing the algorithm’s shift from global search to local refinement—and is grounded in two methods: the relative improvement judgment principle for numerical optimization [36] and performance stagnation detection for evolutionary computation [37]. It quantifies the objective function’s smooth improvement rate to judge the transition from global exploration to local refinement.

Let

f_{t}

denote the current optimal target value at the i-th evaluation. Random noise in heuristic search means single evaluation fluctuations cannot reflect true convergence trends. Thus, a sliding window of size

W = 50

is introduced, and the smoothing improvement rate

Δ_{t}

is defined as the arithmetic mean of relative changes within the window

Δ_{t} = \frac{1}{W} \sum_{i = t - W + 1}^{t} \frac{| f_{i} - f_{i - 1} |}{| f_{i - 1} |}

(35)

This indicator is used to judge convergent steady states, with the starting moment

τ

satisfying two conditions:

Δ_{t} < ϵ

and this low level maintained for P subsequent steps. Based on this standard, the search process has two stages: when

t < τ

, the algorithm is in the exploration phase, with drastic improvement rate fluctuations and a focus on broad-area optimization, and when

t \geq τ

, it enters the exploitation phase, performing refined optimization within converged local regions.

As shown in Figure 18, stage transition timing varies sharply across optimization objectives. Fast-converging tasks like Balanced trigger the exploitation phase earlier, while complex search space tasks like Time and Density extend the exploration phase. This difference proves Dueling DQN-HH adaptively adjusts exploration–exploitation time allocation based on objective function search terrain characteristics, instead of adopting a fixed search rhythm.

Figure 19 presents a heatmap of operator usage during the exploitation phase across objectives and datasets. Overall, biasedCross and MultiCross operators maintain high call rates, forming the algorithm’s basic global search mechanism. However, operator combination patterns vary significantly with objectives and dataset characteristics. On the CS dataset, operator preferences correlate clearly with optimization objectives: the Time task raises local search operator frequency for fine-grained solution compression, while the Density task increases BiasedCross usage, highlighting the importance of global structural restructuring. On the PsyDataset—with its strong modular structure and long-chain dependencies—the algorithm sharply reduces biased operator calls and increases usage of local search operators like HillClimb and LocalHC, avoiding modular structure disruption.

This differentiated operator evolution behavior shows Dueling DQN-HH does not learn fixed operator preference patterns. Instead, it dynamically adjusts scheduling strategies based on search feedback, balancing global exploration and local exploitation adaptively. This characteristic provides crucial support for the framework, enabling stable performance across different optimization objectives and heterogeneous knowledge graph environments.

5. Conclusions and Outlook

This paper proposes a hyperheuristic operator scheduling framework called Dueling DQN-HH. It is based on deep reinforcement learning and aims to solve learning path optimization problems—ones with complex preconditions and diverse optimization objectives. The framework learns operator selection strategies to dynamically and adaptively control the search process. This way, it avoids the reliance on fixed rules or human experience that traditional hyperheuristic methods have. Multiple experimental results show that the proposed method performs better than several representative comparative algorithms in three aspects: solution quality, convergence stability, and computational efficiency. These results come from tests on various optimization objectives and heterogeneous knowledge graph datasets, proving the method’s effectiveness and robustness in complex learning path optimization tasks.

The optimization of learning time (

F_{t i m e}

) has substantial practical significance for both learners and educators. The proposed framework enhances the overall learning experience by reducing redundant review cycles and generating a logically coherent progression of knowledge. By enabling more efficient navigation within complex knowledge graphs, the system helps sustain learner motivation and alleviates cognitive overload. As a result, students can advance more smoothly without expending effort on disorganized or suboptimal content sequences. The time savings further allow learners to shift their focus from rote memorization to higher-order cognitive activities, such as creative problem-solving and in-depth exploration, fostering a deeper mastery of the subject matter.

Further ablation experiments and operator evolution analysis show that Dueling DQN-HH does not form a fixed operator preference pattern. Instead, it adjusts the balance between exploration and exploitation adaptively according to the characteristics of the objective function and the knowledge graph topology. This structure-aware operator scheduling capability is a key factor for the framework’s objective independence and cross-dataset generalization performance. Despite these strengths, the interpretability of the operator scheduling mechanism remains limited, as the high-level decision logic within the neural network acts as a “black box.” Therefore, future research will aim to deconstruct the network’s decision-making process to uncover the specific rationale behind operator selections under diverse problem states, thereby enhancing the transparency and reliability of the hyper-heuristic framework.

Future work will focus on three key directions: multi-objective optimization modeling, adaptive expansion of operator space, and online learning mechanisms integrated with adaptive educational platforms. By leveraging real-time learner data—such as quiz scores and engagement levels—as dynamic inputs, the framework can function as a responsive recommendation engine. This would allow the agent to refine its strategy in real-time to address learning plateaus or knowledge gaps, transitioning from static path optimization to a truly personalized, live learning environment. Researchers will also explore the application potential of this framework in other path optimization problems, the ones with complex constraints and long-sequence decision characteristics. Furthermore, given its robust handling of long-sequence constraints and structure-aware decision logic, the Dueling DQN-HH framework holds significant potential for cross-domain applications. This includes logistics optimization, industrial workflow scheduling, and large-scale curriculum design, where complex precedence relationships must be managed efficiently.

Author Contributions

Conceptualization, Y.-W.Z.; Methodology, Y.-W.Z. and M.-Y.Z.; Software, W.-K.X.; Validation, M.-Y.Z., W.-K.X. and J.-D.L.; Investigation, W.-K.X., X.-Y.Z. and J.-D.L.; Writing—original draft, M.-Y.Z.; Writing—review and editing, Y.-W.Z.; Visualization, M.-Y.Z. and X.-Y.Z.; Supervision, Y.-W.Z.; Project administration, Y.-W.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors on request.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A. Detailed Description of the State Feature Vector

Table A1. Detailed description of the 68-dimensional state feature vector.

Feature Category	Dim.	Specific Feature Items	Feature Description
Basic Features	25	Encoding vector statistical features (10 dimensions)	Mean, standard deviation, minimum value, maximum value, median, 25th/75th percentiles, skewness, kurtosis, non-zero ratio.
		Objective value-related features (5 dimensions)	Current objective value, normalized objective value, moving average of objective value, objective value change rate, gap from the optimal objective value.
		Historical improvement features (5 dimensions)	Recent improvement amount, cumulative improvement amount, improvement frequency, maximum single improvement, consecutive improvement count.
		Operator usage features (3 dimensions)	Recent operator usage frequency, optimal operator index, operator type proportion.
		Solution quality features (2 dimensions)	Solution feasibility score, solution stability score.
Distance-related Features	10	Order distance feature (1 dimension)	Normalized Kendall tau distance from the ordered sequence.
		Prerequisite relationship distance statistics (8 dimensions)	Average/median/standard deviation/minimum/maximum prerequisite distance, 25th/75th percentiles, proportion of long distances (>50).
		Sequence continuity feature (1 dimension)	Normalized average gap of consecutive concept indices (higher values indicate better continuity).
Operator Performance	30	Operator usage rate (10 dimensions)	Usage proportion of each of the 10 operators.
		Operator success rate (10 dimensions)	Improvement success ratio of each of the 10 operators.
		Operator average improvement amount (10 dimensions)	Average objective value improvement amplitude of each of the 10 operators.
Global Progress	3	Step progress (1 dimension)	Current step/maximum steps (normalized to 0–1).
		Stagnation count (1 dimension)	Normalized value of consecutive non-improvement steps (0–1, capped at 100 steps).
		Solution quality gap (1 dimension)	Quality gap between the current solution and the global optimal solution (normalized).

Note: The total 68 dimensions are calculated based on L = 10 operators (25 + 10+ 3 + 3L). This feature engineering approach is extensible to any operator library size L, where the operator-specific metrics (Usage, Success Rate, Avg. Improvement) scale linearly.

References

Gligorea, I.; Cioca, M.; Oancea, R.; Gorski, A.T.; Gorski, H.; Tudorache, P. Adaptive learning using artificial intelligence in e-learning: A literature review. Educ. Sci. 2023, 13, 1216. [Google Scholar] [CrossRef]
Dutta, S.; Ranjan, S.; Mishra, S.; Sharma, V.; Hewage, P.; Iwendi, C. Enhancing educational adaptability: A review and analysis of AI-driven adaptive learning platforms. In Proceedings of the 2024 4th International Conference on Innovative Practices in Technology and Management (ICIPTM); IEEE: Piscataway, NJ, USA, 2024. [Google Scholar]
Lebis, A.; Humeau, J.; Fleury, A.; Lucas, F.; Vermeulen, M. Fully individualized curriculum with decaying knowledge, a new hard problem: Investigation and recommendations. Int. J. Artif. Intell. Educ. 2024, 34, 1102–1137. [Google Scholar] [CrossRef]
Tulving, E. Ebbinghaus’s memory: What did he learn and remember? J. Exp. Psychol. Learn. Mem. Cogn. 1985, 11, 485. [Google Scholar] [CrossRef]
Sweller, J. Cognitive load during problem solving: Effects on learning. Cogn. Sci. 1988, 12, 257–285. [Google Scholar] [CrossRef]
Mazyavkina, N.; Sviridov, S.; Ivanov, S.; Burnaev, E. Reinforcement learning for combinatorial optimization: A survey. Comput. Oper. Res. 2021, 134, 105400. [Google Scholar] [CrossRef]
Elshani, L. Personalized Learning Path Generation Based on Genetic Algorithms. Bachelor’s Thesis, University for Business and Technology, Jeddah, Saudi Arabia, 2021. [Google Scholar]
Benmesbah, O.; Lamia, M.; Hafidi, M. An enhanced genetic algorithm for solving learning path adaptation problem. Educ. Inf. Technol. 2021, 26, 5237–5268. [Google Scholar] [CrossRef]
Imamah; Yuhana, U.L.; Djunaidy, A.; Purnomo, M.H. Development of dynamic personalized learning paths based on knowledge preferences and the ant colony algorithm. IEEE Access 2024, 12, 144193–144207. [Google Scholar] [CrossRef]
Ho, Y.C.; Pepyne, D.L. Simple explanation of the no-free-lunch theorem and its implications. J. Optim. Theory Appl. 2002, 115, 549–570. [Google Scholar] [CrossRef]
Burke, E.K.; Gendreau, M.; Hyde, M.; Kendall, G.; Ochoa, G.; Özcan, E.; Qu, R. Hyper-heuristics: A survey of the state of the art. J. Oper. Res. Soc. 2013, 64, 1695–1724. [Google Scholar] [CrossRef]
Wang, F.; He, Q.; Li, S. Solving combinatorial optimization problems with deep neural network: A survey. Tsinghua Sci. Technol. 2024, 29, 1266–1282. [Google Scholar] [CrossRef]
Abbas, A.K.; Yassen, E.T. Machine Learning-Driven Tri-Level Hyper-Heuristic Selection with Adaptive Move Acceptance for Composing Medical Crew Scheduling Problem. IEEE Access 2026, 14, 37206–37232. [Google Scholar] [CrossRef]
Xu, K.; Shen, L.; Liu, L. Enhancing column generation by reinforcement learning-based hyper-heuristic for vehicle routing and scheduling problems. Comput. Ind. Eng. 2025, 206, 111138. [Google Scholar] [CrossRef]
Lafta, M.H. Path scheduling and target trajectory optimization in UAVs based on dragonfly and firefly algorithm. Adv. Eng. Intell. Syst. 2022, 1, 66–80. [Google Scholar]
Davies, L.; Jánošík, D. Enhanced prediction of California bearing ratio (CBR) values in geotechnical engineering using decision tree algorithm and meta-heuristic optimizations. J. Artif. Intell. Syst. Model. 2024, 2, 29–44. [Google Scholar]
Mohi Ud Din, N.; Assad, A.; Ul Sabha, S.; Rasool, M. Optimizing deep reinforcement learning in data-scarce domains: A cross-domain evaluation of double DQN and dueling DQN. Int. J. Syst. Assur. Eng. Manag. 2024, 1–12. [Google Scholar] [CrossRef]
Gök, M. Dynamic path planning via Dueling Double Deep Q-Network (D3QN) with prioritized experience replay. Appl. Soft Comput. 2024, 158, 111503. [Google Scholar] [CrossRef]
Bean, J.C. Genetic algorithms and random keys for sequencing and optimization. ORSA J. Comput. 1994, 6, 154–160. [Google Scholar] [CrossRef]
Mendes, J.J.; Gonçalves, J.F.; Resende, M.G. A random key based genetic algorithm for the resource constrained project scheduling problem. Comput. Oper. Res. 2009, 36, 92–109. [Google Scholar] [CrossRef]
Hartmann, S. A self-adapting genetic algorithm for project scheduling under resource constraints. Nav. Res. Logist. (NRL) 2002, 49, 433–448. [Google Scholar] [CrossRef]
Hamad, D.R.; Rashid, T.A. LPBSA: Enhancing optimization efficiency through learner performance-based behavior and simulated annealing. arXiv 2024, arXiv:2501.14759. [Google Scholar]
Li, X.; Chen, N.; Ma, H.; Nie, F.; Wang, X. A parallel genetic algorithm with variable neighborhood search for the vehicle routing problem in forest fire-fighting. IEEE Trans. Intell. Transp. Syst. 2024, 25, 14359–14375. [Google Scholar] [CrossRef]
Bilgin, B.; Özcan, E.; Korkmaz, E.E. An experimental study on hyper-heuristics and exam timetabling. In Proceedings of the International Conference on the Practice and Theory of Automated Timetabling; Springer: Berlin/Heidelberg, Germany, 2006; pp. 394–412. [Google Scholar]
Sheng, X.; Lan, K.; Jiang, X.; Yang, J. Adaptive curriculum sequencing and education management system via group-theoretic particle swarm optimization. Systems 2023, 11, 34. [Google Scholar] [CrossRef]
Babu, T.; Nair, R.R.; Sumanth, S.; TP, I.; KS, C. Hybrid Approach: Combining Hill Climbing and Genetic Algorithms for Traveling Salesman Problem. In Proceedings of the 2024 International Conference on IT Innovation and Knowledge Discovery (ITIKD); IEEE: Piscataway, NJ, USA, 2025; pp. 1–5. [Google Scholar]
Shao, W.; Shao, Z.; Pi, D. Lot sizing and scheduling problem in distributed heterogeneous hybrid flow shop and learning-driven iterated local search algorithm. IEEE Trans. Autom. Sci. Eng. 2023, 21, 6483–6497. [Google Scholar] [CrossRef]
Kendall, M.G. A new measure of rank correlation. Biometrika 1938, 30, 81–93. [Google Scholar] [CrossRef]
Zhang, Z.Q.; Wu, Z.M.; Qian, B.; Hu, R. A reward-shaping dueling distributed multi-agent deep reinforcement learning framework for dynamic flexible job shop scheduling with random job arrivals. Expert Syst. Appl. 2025, 297, 128951. [Google Scholar] [CrossRef]
Tian, Y.; Li, X.; Ma, H.; Zhang, X.; Tan, K.C.; Jin, Y. Deep reinforcement learning based adaptive operator selection for evolutionary multi-objective optimization. IEEE Trans. Emerg. Top. Comput. Intell. 2022, 7, 1051–1064. [Google Scholar] [CrossRef]
Chang, J.; Yu, D.; Hu, Y.; He, W.; Yu, H. Deep reinforcement learning for dynamic flexible job shop scheduling with random job arrival. Processes 2022, 10, 760. [Google Scholar] [CrossRef]
Xu, H.; Zheng, J.; Huang, L.; Tao, J.; Zhang, C. Solving Dynamic Multi-Objective Flexible Job Shop Scheduling Problems Using a Dual-Level Integrated Deep Q-Network Approach. Processes 2025, 13, 386. [Google Scholar] [CrossRef]
Hiraoka, T.; Imagawa, T.; Hashimoto, T.; Onishi, T.; Tsuruoka, Y. Dropout q-functions for doubly efficient reinforcement learning. arXiv 2021, arXiv:2110.02034. [Google Scholar]
Huang, Q.; Lu, L.; Wu, X.; Jiang, F.; Wang, X.; Wang, X. A Memetic Walrus Algorithm with Expert-guided Strategy for Adaptive Curriculum Sequencing. arXiv 2025, arXiv:2506.13092. [Google Scholar] [CrossRef]
Li, S.; Chen, H.; Liu, X.; Li, J.; Peng, K.; Wang, Z. Online personalized learning path recommendation based on saltatory evolution ant colony optimization algorithm. Mathematics 2023, 11, 2792. [Google Scholar] [CrossRef]
Ghoreishi, S.N.; Clausen, A.; Jørgensen, B.N. Termination Criteria in Evolutionary Algorithms: A Survey. In Proceedings of the IJCCI, New York, NY, USA, 1–3 November 2017; pp. 373–384. [Google Scholar]
Dennis, J.E., Jr.; Schnabel, R.B. Numerical Methods for Unconstrained Optimization and Nonlinear Equations; SIAM: Philadelphia, PA, USA, 1996. [Google Scholar]

Figure 1. Dueling DQN Hyper-Heuristic architecture.

Figure 2. Schematic of encoding and decoding process.

Figure 3. Flowchart of the priority-based greedy decoding.

Figure 4. Classification structure of low-level operators.

Figure 5. Internal logic and structure of the composite reward mechanism.

Figure 6. Network architecture of the proposed Dueling DQN.

Figure 7. CS knowledge point prerequisitediagram.

Figure 8. Knowledge graph hierarchical distribution analysis.

Figure 9.

F_{time}

minimization algorithm comparison.

Figure 9.

F_{time}

minimization algorithm comparison.

Figure 10. The three-stage evolutionary progression of the optimal learning path.

Figure 11. Learning time minimization operator evolution.

Figure 12. Network architecture ablation convergence.

Figure 13. Ablation of reward-exploration mechanisms.

Figure 14. Algorithm convergence under four objectives.

Figure 15. Algorithm metric distribution comparison.

Figure 16. Topological structure of PsyDataset.

Figure 17. PsyDataset performance comparison.

Figure 18. Operator evolution diagram.

Figure 19. Heatmap of operator usage.

Table 1. The designed low-level heuristic operators.

Category	ID	Operator Name	Mechanism Summary
Perturbation	LLH1	Swap Perturbation	Randomly swaps k-pairs of indices
	LLH2	Adaptive Gaussian Mutation	Adjusts noise $σ$ based on success history
	LLH3	Smart Swap Mutation	Swaps indices with similar priority values
	LLH4	Biased Crossover	Inherits from the better parent with prob. 0.7
	LLH5	Multi-Point Crossover	Combines segments from two parents using m cuts
Local Search	LLH6	Hill Climbing	Greedy acceptance of random perturbations
	LLH7	Simulated Annealing	Probabilistic acceptance to escape local optima
	LLH8	Fast Local Hill Climbing	Rapid greedy refinement with small perturbations
	LLH9	Variable Neighborhood Search	Systematically increases neighborhood size $σ$
	LLH10	Iterated Local Search	Cycles of perturbation and local search

Table 2. Definition of the 68-dimensional state feature vector.

Feature Index x	Feature Group	Specific Metrics (Summary)	Dimension
0–9	Solution State	Encoding Statistics, Objective Value Status	10
10–24	Search History	Optimization Trajectory, Action Entropy	15
25–34	Topology State	Prerequisite Distance Stats, Sequence Continuity	10
35–64	Operator History	Success Rate, Avg. Improvement, Recent Reward	30
65–67	Global Progress	Step Ratio, Stagnation Count, Optimality Gap	3
Total			68

Table 3. LLH’s cost-strategic role summary.

Linked Operator	Algorithm Type	Strategic Role	Cost
LLH1, LLH2, LLH3	Mutation	Light Perturbation	0.1
LLH4, LLH5	Crossover	Information Exchange	0.2
LLH6, LLH8	Hill Climbing	Standard Exploitation	0.5
LLH7, LLH9	SA/VNS	Deep Intensification	1.5
LLH10	ILS	Heavy Exploration	2.0

Table 4. Summary of statistical metrics (average over 10 runs).

Algorithm	Mean	Std	Median	Eff.	p-Value	Cohen’s d	CV (%)
SRHH	4903.4	42.0	4904.5	0.3128	3.30 × 10⁻⁴	−2.42	0.9
GA-HH	4892.9	56.2	4892.8	0.3127	1.01 × 10⁻³	−1.917	1.1
ILS-HH	5035.2	63.7	5020.6	0.3085	1.83 × 10⁻⁴	−4.236	1.3
VNS-HH	5048.3	65.1	5041.9	0.2635	1.83 × 10⁻⁴	−4.404	1.3
MA-HH	5024.5	89.9	5013.5	0.2354	1.83 × 10⁻⁴	−3.207	1.8
PSO	5271.4	55.9	5255.5	0.1221	1.83 × 10⁻⁴	−8.945	1.1
SEACO	5816.4	51.9	5822.4	0.0768	1.83 × 10⁻⁴	−19.803	0.9
Dueling DQN-HH	4789.2	51.9	4789.3	0.3928	—	—	1.1

Table 5. Dueling DQN-HH ablation comparison (best converged values).

Variant	Final Time (h)	Std
Pure DQN-HH (Baseline)	4856.28	21.04
+Double DQN	4802.85	67.89
+PER (Pure)	4796.88	8.36
−PER (Dueling Only)	4756.18	22.18
Dueling DQN-HH	4751.45	26.06

Table 6. Reward mechanism ablation results.

Variant	Final Time (h)	Improvement (%)	Wall Time (s)
Full Model	$4751.45 \pm 26.06$	8	598.8
w/o Cost-Aware	$4756.18 \pm 22.18$	6	849.4
w/o Shaping	$4802.85 \pm 67.89$	7.8	742.9
w/o Exploration	$4856.28 \pm 21.04$	7.4	803.8

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhang, Y.-W.; Zhu, M.-Y.; Xia, W.-K.; Zhang, X.-Y.; Liu, J.-D. A Dueling DQN-Based Hyper-Heuristic Framework for Learning Path Optimization. Big Data Cogn. Comput. 2026, 10, 153. https://doi.org/10.3390/bdcc10050153

AMA Style

Zhang Y-W, Zhu M-Y, Xia W-K, Zhang X-Y, Liu J-D. A Dueling DQN-Based Hyper-Heuristic Framework for Learning Path Optimization. Big Data and Cognitive Computing. 2026; 10(5):153. https://doi.org/10.3390/bdcc10050153

Chicago/Turabian Style

Zhang, Yong-Wei, Ming-Yang Zhu, Wen-Kai Xia, Xin-Yang Zhang, and Jin-Di Liu. 2026. "A Dueling DQN-Based Hyper-Heuristic Framework for Learning Path Optimization" Big Data and Cognitive Computing 10, no. 5: 153. https://doi.org/10.3390/bdcc10050153

APA Style

Zhang, Y.-W., Zhu, M.-Y., Xia, W.-K., Zhang, X.-Y., & Liu, J.-D. (2026). A Dueling DQN-Based Hyper-Heuristic Framework for Learning Path Optimization. Big Data and Cognitive Computing, 10(5), 153. https://doi.org/10.3390/bdcc10050153

Article Menu

A Dueling DQN-Based Hyper-Heuristic Framework for Learning Path Optimization

Abstract

1. Introduction

2. Problem Description and Modeling

2.1. Prerequisite Knowledge and Symbol Definitions

2.2. Mathematical Modeling

2.2.1. Decision Variable

2.2.2. Constraints

2.3. Objective Function

2.3.1. Primary Objective

2.3.2. Supplementary Verification Target

3. Methodology

3.1. General Framework

3.2. Domain Isolation Policy

3.2.1. Encoding Strategy

3.2.2. Decoding Mechanism

3.3. Low-Level Heuristic Operator Library

3.4. High-Level Strategy Design: Dueling DQN

3.4.1. State Space

3.4.2. Action Space

3.4.3. Reward Function

3.4.4. Network Architecture

3.4.5. Training Strategy

4. Experiments and Results Analysis

4.1. Dataset

4.2. Experimental Setup and Benchmark Algorithm

4.3. Performance Analysis of Minimizing Learning Time

4.4. Ablation Experiment

4.4.1. Network Architecture Ablation Analysis

4.4.2. Analysis of the Effectiveness and Efficiency of Reward Mechanisms

4.5. Supplementary Verification

4.5.1. Performance Verification of Different Objective Functions

4.5.2. Cross-Dataset Generalization Validation

4.6. Operator Evolution and Strategy Analysis

5. Conclusions and Outlook

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Appendix A. Detailed Description of the State Feature Vector

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI