This study uses an improved Dueling Deep Q-Network as the high-level control for the hyper-heuristic method. The operator selection space is discrete, and the learning process must handle delayed feedback during long decision sequences, so a value-based Dueling DQN fits this task better than policy-based methods. The model does not change the solution directly, and it selects a suitable heuristic operator from the operator library based on the current search state.
3.4.2. Action Space
To endow the intelligent agent with the capability to flexibly schedule operators, let
represent all operations required by the
. The action space
A is defined as the union of the perturbation class and local search class action sets. Let the perturbation class operator set be
, and the local search class operator set be
. The action space
A is formulated as
When an agent selects action
, it directly triggers the corresponding operator to execute. This mechanism enables agents to adaptively generate arbitrary sequences of operators in the form
, effectively breaking through the structural constraints of traditional fixed “global–local” operator combinations. Furthermore, based on computational complexity, each action is assigned a specific strategic role and normalized computational cost, as shown in
Table 3. Cost values reflect differences in temporal complexity: lightweight perturbation operators carry minimal cost for rapid experimentation, while heavyweight global search operators incur high overhead. The cost parameters are manually predefined and fine-tuned during experimentation to optimize the algorithm’s performance. The algorithm employs a cost penalty mechanism to restrict their invocation only when expected returns are sufficiently high.
3.4.3. Reward Function
Traditional reward functions typically rely solely on the improvement in the objective function (
), which often leads agents to over-depend on computationally intensive heavyweight operators or become trapped in local optima during search plateaus due to insufficient feedback. To address this, this study designs a reward function mechanism centered on cost awareness. By incorporating nonlinear normalized returns and a multi-level exploration mechanism, the agent is guided to seek an optimal balance between “return–cost” and “exploration–exploitation.” The total reward
is composed of the following five weighted components:
The core mechanism is the cost-aware main reward
, designed to address the issue of vastly differing computational costs among operators. This paper proposes a cost penalty mechanism based on a nonlinear normalized improvement rate. Unlike simple linear interpolation, this mechanism is more sensitive to minor improvements while smoothing out substantial ones. Let
and
denote the objective function values before and after executing an operator, respectively.
is defined as
The primary reward
is defined as a piecewise function to handle different optimization outcomes:
where
denotes the normalized computational cost of operator
, reflecting the intrinsic variance in temporal complexity across different heuristics. The term
serves as the cost sensitivity factor, which modulates the agent’s responsiveness to the consumption of computational resources.
The mathematical rationale for this penalty term is to transform the reward from a single-objective improvement measure into a “resource-efficiency” metric. This transition is inspired by the efficiency-centric evaluation frameworks utilized in other complex optimization tasks, such as the power-balanced trajectory planning in UAVs [
15] and the redundancy-constrained parameter searches in engineering modeling [
16]. By adopting this “benefit-to-cost” assessment logic, when
, the square root term smoothes the normalized improvement to ensure lightweight operators remain competitive with heavyweight ones, optimizing computational cost-effectiveness. For non-improving cases, the reward distinguishes between resource waste (
) and solution deterioration (
). In the latter, the penalty is proportional only to the degree of degradation. Notably, cost penalties are not superimposed here to prevent network collapse caused by double negative feedback.
In addition to the main reward, the reward design includes a shaping reward
[
29] to reduce the problem of sparse feedback in reinforcement learning. This shaping reward contains two parts, and they are a distance-based reward
and a global proximity reward
, which help guide the learning process during the search.
where
denotes the reduction in the average forward dependency distance, and
represents the historical best objective value achieved along the current search trajectory. This term provides the agent with gradient guidance when the objective function value does not change significantly.
To keep population diversity and to find effective operators, the reward design includes an exploration reward based on operator usage and a behavior reward based on success rate. The exploration reward
is set according to the usage rate
of each operator within a sliding window, and this value changes during the search.
The system also forms a closed-loop feedback mechanism through behavioral rewards
and dynamic penalties
. Specifically, the agent rewards operators exhibiting high success rates and significant historical improvements based on real-time performance tracking. Conversely, it imposes dynamic penalties on operators that fail to improve the solution or trigger the rollback mechanism due to catastrophic degradation, thereby preventing training instability [
30].
is used to reward the long-term stability of operators. It is defined as:
where
represents the operator success rate.
represents the cumulative call count of the operator.
denotes the historical average improvement.
where
is the expectation of improvement representing the historical average gain across all operators to couple the reward benchmark with the search difficulty;
and
denote relative thresholds defining performance tiers to identify superior operators; and
and
represent reward scaling factors that modulate the magnitude of this behavioral component within the total reward
.
is used to constrain inefficient operators in real-time and includes the blacklist mechanism and rollback penalty. It is defined as:
and
are defined to address long-term inefficiency and instantaneous deterioration, respectively.
where
denotes the recent success rate of operator
, and
is the trigger threshold (e.g.,
), ensuring the penalty is triggered only after sufficient statistical samples are accumulated.
where
denotes the tolerance deviation.
The internal logic of the composite reward mechanism is visualized in
Figure 5. It illustrates how raw inputs—such as performance gains (
), operator costs (
), and historical statistics—are processed by five parallel modules and aggregated into the final reward
. This structure ensures the agent receives dense, multi-dimensional feedback to guide the optimization.
3.4.4. Network Architecture
The Dueling DQN framework in this study improves learning efficiency in high-dimensional state spaces where reward signals are sparse by separating state value learning from action advantage learning. To reduce training instability and limit value fluctuation [
31,
32], the framework uses two deep neural networks with the same structure, including a policy network
that is updated during training and a target network
that is updated at fixed intervals. Both networks follow the dueling design, and each network maps the normalized state vector
to Q-values over the action set
A.
Figure 6 shows the detailed structure of the network.
During the feature extraction phase, the network first maps the raw states to nonlinear high-order feature representations through a shared multilayer perceptron. Let
denote the mapping function of the shared layer, parameterized by
.
represents the set of all trainable parameters (weights and biases) in the shared layers. This process combines the ReLU activation function with dropout regularization [
33] to enhance the model’s generalization capability. The mathematical expression is:
where
is the resulting abstract feature vector of the output.
Subsequently, to overcome the limitation of traditional DQN in distinguishing between state value and action advantage, the state
is first processed by the shared layer to extract the abstract feature vector
, which is then fed into the dual-stream branching architecture of the Dueling framework. The value stream employs the transformation
to estimate the scalar state value
, while the advantage stream uses the transformation
to estimate the advantage vector
. The calculation formulas for both are as follows:
Ultimately, to ensure the unique identifiability of
V and
A, the network employs a mean-centered constraint at the output to aggregate the two streams. The final action value function
is defined as:
where
represents the complete set of trainable parameters, and
denotes the size of the action space. The variable
is a placeholder used to iterate over all possible operators in the action space during the calculation of the average advantage. By subtracting the mean of the advantage values, this aggregation formula forces
to directly approximate the actual value of the state. This enables the network to converge rapidly even when actions have minimal impact on states with small differences.
3.4.5. Training Strategy
This experiment employs a Dueling DQN algorithm integrated with Prioritized Experience Replay (PER) to optimize network parameters
. By minimizing temporal difference errors, the algorithm leverages the structural advantages of the dueling architecture to mitigate value fluctuation risks and improve sample efficiency. The target value
is formulated based on the standard DQN mechanism:
where
denotes the immediate reward obtained by the agent after executing action
in state
, and
is the discount factor. The loss function
consists of a weighted mean squared error and an entropy regularization term
. The entropy coefficient
is introduced to encourage exploration and prevent premature convergence:
where
B is the mini-batch size, and
is the action probability distribution derived from the
Q-values. In the experience replay module, the probability
of sampling a transition
i is proportional to its prediction error
, where
represents the TD-error. Transitions with larger errors receive higher replay priority:
where
is a small positive constant ensuring all samples have a non-zero probability of being selected and
controls the strength of prioritization. To correct distribution biases introduced by non-uniform sampling, an importance sampling weight
is incorporated during gradient updates, where
N is the current buffer size and
is the bias correction coefficient. The complete Dueling DQN-HH training and parameter update process is summarized in Algorithm 1.
| Algorithm 1 Dueling DQN-HH training and parameter update process |
- Require:
Max episodes M, Batch size B, - 1:
Initialize , Buffer D - 2:
for episode to M do - 3:
Initialize state - 4:
for step to T do - 5:
Select based on - 6:
Execute , observe - 7:
Store in D - 8:
if then - 9:
Sample Batch by - 10:
Compute - 11:
Calculate - 12:
Update - 13:
end if - 14:
- 15:
end for - 16:
end for
|