This section first presents the workflow of DS-IHBO, followed by a detailed explanation of the individual modules that comprise DS-IHBO.
3.2. Module I: Removing Irrelevant Features
Building upon the SU measure, the interdependence
between features
and
can be quantified using Equation (5). However, solely considering feature–class relationships may overlook inter-feature redundancies. Inspired by max-relevance and min-redundancy (mRMR) [
45], a
relevance-redundancy index (
) is proposed to assess feature importance. The computation involves: (1) selecting the target feature
; (2) establishing the remaining feature subset
L; and (3) calculating
for
via Equation (6).
where
denotes the number of features in
L. This produces a vector encoding each feature’s class-relevance while incorporating redundancy considerations. The min-max normalization is used to normalize the
relevance-redundancy index to
, as shown in Equation (7).
For a dataset
S with
D features
, the value
for each feature
(where
) is first calculated. Features with normalized relevance-redundancy indices
above the specified threshold
are identified as strongly relevant features and aggregated into a subset
. The threshold
is crucial in determining these strongly relevance features. Traditionally,
is set to a very small constant. However, in some datasets, all feature relevance values are relatively high. Taking the Yale_64 dataset as an instance, where typical relevance values approximate 0.5, the threshold
is mathematically established following Song et al. [
25] in Equation (8).
where
represents the rounding down function. Two constraints are designed in Equation (8) to balance “avoiding missing effective features” and “removing irrelevant features”:
Constraint 1: is the maximum value of the normalized relevance-redundancy index (ranging from 0 to 1) in the dataset. Multiplying by 0.1 ensures that does not exceed 10% of the most relevant feature’s index. This avoids setting an excessively high threshold (e.g., in datasets where most are low, a fixed threshold like 0.2 might filter out all valid features). For example, if , this term limits to at most 0.05, preventing over-filtering.
Constraint 2: represents the “recommended number of candidate features” derived from the dataset dimension D (a common empirical rule in high-dimensional FS to balance computational efficiency and information retention). is the value of the feature ranked at this “recommended number” (sorted in descending order of ). This constraint ensures that is aligned with the dataset’s dimension, avoiding excessively low thresholds (e.g., in high-dimensional datasets with , , so is at least the of the 144th feature, preventing retaining too many redundant features).
By taking the minimum of the two constraints, automatically adapts to the dataset’s relevance level and dimension, ensuring it is neither too high (missing valid features) nor too low (retaining noise).
3.3. Module II: The Improved HBO Algorithm
The steps of the IHBO algorithm are as follows:
(i) Population initialization with asymmetric feature stratification: Traditional random initialization often exhibits strong distributional symmetry, leading to populations concentrated in limited regions of the search space and consequently reducing population diversity. To break this symmetry and introduce controlled asymmetry, a novel initialization mechanism based on the relevance-redundancy index is proposed, as detailed in Algorithm 1.
Specifically, features are first divided into three groups reflecting asymmetric relevance levels: the top 10% as high-relevance, the next 10–50% as mid-relevance, and the remaining 50% as low-relevance. This stratified division intentionally introduces structural asymmetry to prioritize features with high discriminative power. For each individual, the number of features to be initialized, denoted by
, is randomly drawn from the interval
, where
D is the total feature dimension. All high-relevance features are symmetrically included to ensure essential information retention, while mid- and low-relevance features are asymmetrically sampled:
features from the mid-relevance group and
from the low-relevance group, where
follows a uniform distribution over
. This asymmetric sampling strategy effectively balances diversity and relevance, reduces redundancy, and promotes symmetric compactness within the selected feature subsets. The process repeats until a population of size
is constructed.
| Algorithm 1 Population initialization strategy |
- 1:
Input: : population size; D: feature dimension - 2:
Output: P: initial population - 3:
for
do - 4:
Calculate relevance-redundancy index (i) - 5:
end for - 6:
Sort features according index (i) - 7:
High-relevance in the top 10%; mid-relevance in the top 10–50%; low-relevance in the bottom 50% - 8:
for do - 9:
Randomly generate in - 10:
Select all high-relevance features - 11:
Randomly generate in - 12:
Randomly select mid-relevance features and low-relevance features - 13:
for do - 14:
if i is selected then - 15:
- 16:
end if - 17:
end for - 18:
end for - 19:
return initial population P
|
(ii)
An update strategy for the maintainer line: In HBO, the maintainer line represents a group of individuals with optimal fitness values. It is used to cross with the sterile line to generate improved sterile individuals with superior traits, thereby guiding the direction of population updates. However, the original HBO does not establish an effective strategy for updating the maintainer line. To address this limitation, this paper proposes an improvement to the maintainer line’s update based on a dynamic selection strategy of adaptive differential operators. Given that differential operators have multiple variants, and each is tailored to distinct types of search tasks, the objective is to integrate the advantages of these diverse differential variants. Accordingly, three improved differential operator variants are proposed: a global differential operator for the global search phase, a transitional one for the transition from global to local search, and a local one for the local search phase. These three differential operators are denoted as
,
, and
, respectively, as shown in Equations (9)–(11).
where
and
are random values in the range of 0 to 1,
represents randomly selected individuals from the population. They satisfy conditions
and
. The term
creates a perturbation vector based on the difference between two randomly chosen individuals. This encourages the search to span wide areas of the solution space. The second term,
, adds another layer of random perturbation relative to the current individual’s position, further enhancing diversity and preventing premature convergence. The strength of this operator lies in its ability to discover new and potentially promising regions, which is critical in the early stages.
where
represents the value of the
j-th dimension of the best solution in the current generation.
incorporates both a random perturbation term (
) for exploration and a term that guides the search toward the best solution (
) for exploitation. It provides a smooth transition from a global, explorative search to a local, exploitative one.
where
F is the scaling factor that controls the transition of the population from global search to local search. The search is anchored to
, the best solution found so far. The subsequent perturbation terms are relatively small variations created from the differences in random individuals. This structure ensures that the search is concentrated in the promising region around the current best solution, allowing for fine-tuning and convergence towards a refined optimum. This is essential in the later stages of the algorithm.
To enable the algorithm to use different selection probabilities for differential strategies in different optimization stages, the probability selection strategies are defined as follows. In the early stage of optimization, a higher probability is assigned to the exploratory strategy
. In the middle stage, the probability gradually shifts to the balancing strategy
. In the later stage of optimization, a higher probability is assigned to the exploitative strategy
to refine the solutions. This paper introduces an adaptive probability generation mechanism to dynamically produce the selection probabilities of the three differential strategies, using the
roulette_wheel_selection algorithm to determine
. As shown in Equations (12)–(17).
where
and
are boundary correction parameters to prevent values before normalization from being too small or too large.
represents the normalized probability that the
i-th difference operator is selected,
t and
T represent the current iteration number and the maximum iteration number, respectively.
Figure 3 demonstrates the trend of adaptive adjustment for the selection probabilities of the three differential strategies as the number of iterations increases.
As shown in
Figure 3, in the early stage of evolution, the algorithm focuses on exploring the global search space to find more promising regions. Therefore, the probability of selecting the global exploration strategy
is higher. As evolution progresses, the optimization process gradually transitions from global search to local search. To smoothly transition between these two stages, the algorithm gradually decreases the probability of selecting the global exploration strategy
and increases the probability of selecting the balancing strategy
. In the later stage of evolution, the selection gradually shifts to the local exploitation strategy
to enhance the search in the vicinity of the optimal solution.
(iii)
Modifications to the hybridization and selfing phases: The objective of the hybridization phase is to update the sterile line. However, the original update strategy does not fully consider the difference between the exploration phase in the early stage and the exploitation phase in the later stage. Therefore, this paper introduces a
t-distribution mutation disturbance strategy to improve the hybridization phase. The
t-distribution, also known as the Student’s
t-distribution, is characterized by its degree of freedom (
) and mutation scale (
). The degree of freedom controls the shape of the distribution, mainly affecting the thickness of the tails, while
determines the width of the distribution. Based on the characteristics of
t-distribution, this paper improves the hybridization phase as shown in Equations (18)–(21).
The random number r of the original hybridization stage is replaced by random sampling of t-distribution. represents a new sterile individual generated by hybridization, where each dimension is disturbed by the random variables generated by t-distribution. represents the Gamma function. and denote the maximum and minimum values of the mutation scale, which control the range of generated random numbers. The larger the mutation scale, the larger the range of generated random numbers, and vice versa. is the growth rate, which controls the curvature of the change curve.
Figure 4 shows the changes in the shape of the
t-distribution under different degrees of freedom and mutation scales. When the degree of freedom is one, the
t-distribution becomes a Cauchy distribution with heavy tails. As the degree of freedom increases, the heavy tails lessen, gradually approaching a normal distribution. A larger mutation scale results in a wider
t-distribution curve, indicating more dispersed data points, while a smaller mutation scale narrows the distribution, concentrating the data points. Therefore, this paper leverages the characteristics of the
t-distribution by combining different degrees of freedom and mutation scales to control its shape, thereby managing the global and local update strategies of individuals. Specifically, in the initial iteration stages, smaller degrees of freedom and larger mutation scales make the
t-distribution similar to the Cauchy distribution with dispersed data, generating larger perturbations that favor global search. As iterations progress, increasing degrees of freedom and decreasing mutation scales cause the
t-distribution to approach a standard normal distribution with smaller variance, resulting in smaller perturbations and a preference for local search.
The selfing phase focuses on updating the restorer line. When a restorer individual reaches the maximum number of selfings, it is considered to have fallen into a local optimum. At this point, the renewal operation is triggered for the individual, which plays a decisive role in the transition between selfing and renewal. Instead of setting a constant value in the original HBO algorithm, this paper introduces an enhanced selfing upper bound, as defined in Equation (22).
where
and
represent the minimum and maximum bounds for selfing count, respectively. In the early iterations, the algorithm is in the global search phase, and most individuals could perform effective updates within a few cycles. Therefore,
is set to a large value. If an individual reaches this bound early, it indicates that it has fallen into a local optimum, which will trigger the renewal operation. As optimization progresses to later iterations, the possibility of an individual falling into a local optimum increases. Therefore, setting
to a smaller value in later stages helps an individual escape the local optimum more quickly.
Collectively, these three improvements are specifically designed to address the core challenges of high-dimensional FS. The asymmetric population initialization addresses the massive search space and feature redundancy: a relevance-redundancy index guides initial searches toward promising feature combinations, effectively pruning unpromising directions from the outset. The adaptive update strategy for the maintainer line delivers the required dynamic balance between exploration and exploitation. Broad exploration in the early stage is critical for navigating the vast feature space, while focused exploitation in the later stage refines the optimal feature subset. Modifications to the hybridization and selfing phases act as mechanisms to preserve population diversity and avoid local optima. The adaptive shape of the t-distribution enables large perturbations for global exploration when required, and the dynamic selfing count ensures the renewal of stagnated individuals, preventing premature convergence.
To analyze the impact of the proposed improved strategy on algorithm time complexity, a comprehensive comparison between HBO and IHBO is conducted at each stage. With population size
N, problem dimension
D, and maximum iteration number
T assumed, the results of the comparative analysis are shown in
Table 3.
Based on the analysis, IHBO’s time complexity in the population initialization stage is approximately D times that of HBO. However, since population initialization occurs only once, its impact on overall time complexity is relatively minor. In the population update stage, IHBO adds only the maintainer line update process compared with HBO, which increases time complexity by approximately 1/3 but significantly enhances the utilization rate of superior individuals and population diversity. Regarding population sorting and best solution update, both algorithms exhibit identical time complexity. Overall, IHBO marginally increases computational overhead only during population initialization and maintainer line update stages. When HBO’s population size is set to approximately 1.5 times that of IHBO, comparable total time complexity can be achieved.
3.4. Module III: Dynamic Surrogate-Assisted Mechanism
(i)
Generation of surrogate units by density-adaptive symmetric sampling: To effectively handle imbalanced data distributions while maintaining representative symmetry between majority and minority classes, an adaptive K-Nearest Neighbors (KNN) algorithm is introduced. This method dynamically adjusts the number of neighbors
K based on local density to discriminate between class-central and class-boundary samples. For a given sample
, the local density
is computed as the average distance to its
k nearest neighbors, as formalized in Equation (23).
Based on
, the number of neighbors
is determined by a threshold
, typically the median of all
in Equation (24).
Subsequently, the algorithm determines whether the
nearest neighbors of a given sample belong to the same class; if so, the sample is classified as central, otherwise as boundary, as illustrated in
Figure 5.
In imbalanced datasets, the average number of samples per class, denoted as , is first calculated. Classes with more than samples are defined as majority classes, while those with or fewer are identified as minority classes. The adaptive KNN algorithm is then applied to distinguish between central samples and boundary samples.
Since majority classes contain substantially more samples than minority ones, using all majority class samples for training would significantly increase computational cost. To mitigate redundancy, all boundary samples are retained in their entirety, as they contain critical information for classification models, while 40% of central samples are randomly sampled, given that they are typically redundant.
Although minority classes have fewer samples, they are crucial for the model. To prevent their influence from being overshadowed by majority classes, the average number of samples
across all majority classes is first calculated. The SMOTE [
46] oversampling technique is then applied to expand minority classes to
. All boundary samples are retained, and 40% of the central samples are randomly selected. Representative samples from both majority and minority classes are combined to form the final set of representative samples. Based on these samples, multiple surrogate units are constructed.
The total number of samples in each dataset is denoted as N. To balance accuracy and computational cost, six surrogate units with different sample sizes are constructed as follows:
- Step 1.
Calculate the target sample sizes for the six surrogate units based on preset proportions of the original dataset size N: 0.75N, 0.65N, 0.55N, 0.45N, 0.35N, and 0.25N;
- Step 2.
Based on the representative sample set, calculate the sample size for each category and determine the proportion of each category: category sample size/total sample size. Set this as the target category proportion to ensure that subsequent surrogate unit construction selects samples from each category according to the same proportion;
- Step 3.
Apply agglomerative clustering algorithm to partition the representative sample set into six clusters;
- Step 4.
Each surrogate unit selects a certain number of target samples evenly from the six clusters. For example, in the first surrogate unit, the algorithm needs to select 0.75N/6 samples from each cluster;
- Step 5.
Traverse the six clusters. Within each cluster, select samples for each category based on the required number of category samples (target category proportion × target sample number from the previous step). If the cluster contains no samples of a particular category, skip that category. If the number of samples for a certain category within the cluster is less than or equal to the required category sample number, select all samples; otherwise, calculate the centroid of that category’s samples in the current cluster and prioritize selecting the required number of samples that are farthest from the centroid (i.e., most representative);
- Step 6.
After completing sample selection for each cluster, check whether the proportion of each category in the surrogate unit conforms to the target category proportion. For categories that do not conform, perform global random supplementation from the remaining unselected samples until reaching the target category proportion (i.e., achieving the target sample number for each surrogate unit).
(ii) Dynamic symmetry-preserving surrogate assistance: The surrogate unit selection process begins with population initialization, where the original dataset S serves as the evaluation benchmark. The global optimal individual , possessing the optimal ground truth fitness , is first identified. Subsequently, l surrogate units generate corresponding fitness values for . By comparing the prediction errors , the most accurate surrogate unit (with minimal deviation) is selected for feature subset evaluation in subsequent iterations.
During the initial
iterations, the feature subsets are evaluated using the surrogate unit, which is periodically updated to ensure alignment with the true dataset distribution. However, overly frequent updates may impair the population’s ability to adapt to evolving search environments. Consequently, the surrogate unit is only refreshed when the global optimum individual
shows no improvement in its real fitness after
consecutive iterations. The surrogate selection mechanism follows the initialization approach described above, with the current
replacing the previous
as the reference solution. Upon completing the surrogate phase (i.e., after
iterations), candidate solutions are evaluated against the original dataset. Our investigation of six
values (0, 25, 50, 75, and 100) revealed 75 as the optimal setting. The complete dynamic surrogate-assisted IHBO method is summarized in Algorithm 2.
| Algorithm 2 Dynamic surrogate combining with IHBO |
- 1:
Input: Surrogate units: . - 2:
Output: A feature subset. - 3:
Initialize IHBO Population, find using the original dataset. - 4:
Select the most appropriate surrogate unit based on . - 5:
while is not reached do - 6:
if then - 7:
Maintainer-update(); - 8:
Hybridization(); - 9:
Selfing(); - 10:
Evaluate individuals using surrogate unit; - 11:
Update , evaluate using original dataset. - 12:
if ’s real fitness is not improved for iterations then - 13:
Select the most appropriate surrogate unit based on . - 14:
end if - 15:
else - 16:
Maintainer-update(); - 17:
Hybridization(); - 18:
Selfing(); - 19:
Evaluate individuals using original dataset; - 20:
Update . - 21:
end if - 22:
- 23:
end while - 24:
return as the final feature subset.
|
The computational complexity of the proposed algorithm mainly includes the following aspects: (1) The complexity of filtering out irrelevant features in Module I is ; (2) Module II and Module III are executed in parallel. The complexity of implementing IHBO is , and thus, the complexity after adopting the dynamic surrogate mechanism is approximately . In the above process, D denotes the number of features in the original data, S denotes the number of samples in the original data, denotes the number of reduced features after the first stage, T denotes the number of iterations, denotes the number of surrogate iterations, and N denotes the population size.