1. Introduction
The circadian rhythm regulates biological processes in 24 h cycles, and its alteration is associated with sleep disorders and can be related to pathologies such as cancer [
1]. The circadian clock plays a fundamental role in the regulation of sleep, and its stability is increasingly recognized as a determining factor in various biological processes, which are essential for maintaining good health. To this end, photopharmacological manipulation of a core clock protein, mammalian Cryptochrome 1 (CRY1), is an effective strategy for the regulation of the circadian clock. CRY1 is a protein that influences the circadian period. Supported by the analysis of data and advanced computational modelling techniques, it has been established as an essential tool in the detection and characterization of compounds with an impact on circadian dynamics [
2]. Recent studies have applied high-throughput screening and data analysis to identify potential drugs, taking advantage of advanced modelling and machine learning techniques [
3].
Structural approaches in drug discovery optimize efficiency and reduce costs, while the incorporation of screening methods allows the elimination of inappropriate molecules, such as toxic or inactive ones. After identifying 171 molecules that target functional domains of the CRY1 protein, using structure-based drug design methods, and experimentally determining that 115 of these molecules were nontoxic, Gul et al. [
4] performed a machine learning study to classify molecules by identifying features that make them toxic. They also addressed the classification of the same molecules based on their effect on CRY1. While both problems are considered challenging, only the second has been further investigated in other studies [
5].
The problem of learning to determine toxicity from these 171 molecules is complex, characterized by many features and few examples. As a result, it is considered ill-posed, as traditional statistical methods often struggle with insufficient data. In this context, overfitting becomes a significant risk, as models can easily memorize the training data instead of generalizing. Therefore, we classify it as an overfitting-prone problem. Tackling these handicaps constitutes an interesting challenge in machine learning [
6]. These handicaps motivate us to apply improved automated feature selection while remaining fully aware of its limitations.
In this paper, we reproduce the machine learning experimentation from [
4], revealing its weaknesses and evaluating the use of a more advanced feature selection process based on genetic algorithm meta-heuristic search to achieve improved and more trustworthy results.
The main contributions of this paper are as follows:
Demonstration of the need for validation: We make evident the importance of incorporating validation after training and feature selection to avoid overfitting. In addition, we provide reliable estimates of the best performance achieved with this dataset.
Genetic algorithm for automated feature selection: We propose a genetic algorithm (GA)-based framework for automated feature selection, which achieves results comparable to or better than manual Recursive Feature Elimination (RFE), reducing human effort and bias.
Test variance as a generalization criterion: We propose the use of variance across cross-validation folds as an additional objective during feature selection, promoting models that not only perform well but also generalize better across different data splits.
2. Genetic Algorithm for Feature Selection
Evolutionary algorithms, inspired by natural evolution, are designed to optimize solutions. Among them, genetic algorithms (GAs) are particularly well-suited for feature selection (FS). These algorithms start with a population of individuals, each representing a possible solution. Through processes of selection, crossover, and mutation, successive generations evolve towards better solutions, until the number of generations specified by parameter N is reached (
Figure 1). The following subsections describe the main aspects of the algorithm used in this work.
2.1. Evolution
The framework of the algorithm is based on the evolution of a single population of solutions over multiple generations.
Once the population has been initialized, each evolutionary cycle begins with the application of crossover and mutation operators to generate new solutions. Subsequently, the individuals in the population are evaluated and selected, based on their performance in the target task, to form the next generation with the objective of progressively improving the quality of the solutions.
Figure 1 illustrates the main steps of the algorithm.
2.2. Encoding
The encoding of solutions determines how potential solutions to the problem are represented. Each solution is encoded as a genotype, which serves as an abstract representation of the solution. To evaluate its quality, the genotype is decoded into its corresponding phenotype, representing the actual interpretation within the problem domain. The effectiveness of the encoding scheme directly influences the algorithm’s ability to explore and optimize solutions efficiently.
In this work, the chosen encoding scheme is binary. Each individual in the population represents a possible solution to the problem, which is described as a vector of length n. Each value in the vector corresponds to a feature of the problem, with a 1 indicating that the feature is selected and a 0 otherwise.
2.3. Initialization
In an evolutionary algorithm, the initial population is generated randomly, assigning to each individual an initial configuration that represents a possible solution to the problem. However, the nature of the problem significantly influences the optimal configuration of individuals. Depending on the dimensionality of the dataset and the complexity of the problem, the expected number of selected features may vary considerably.
GAs typically initialize individuals with a uniform probability of inclusion or exclusion of each feature, resulting in an expected selection of 50% of the features. However, in high-dimensionality datasets, this strategy can generate overfitted initial solutions, which significantly slows down the evaluation of individuals. Since the GA will naturally adjust the number of features selected based on their impact on performance, a more efficient initialization strategy may be to start with a smaller number of features when working with high-dimensional spaces [
7].
To strike a balance between simplicity and generality, the initialization strategy used in our algorithm is based on an adjustable likelihood as parameter [
8].
2.4. Crossover Operator
The crossover operation combines coded individuals to explore potentially better solutions. In this work, two-point crossover, a method widely used in GAs, is employed to generate the next generation within the population. This mechanism exchanges a segment of the genotypic vector between two parent individuals, bounded by two randomly selected points.
Figure 2 illustrates this process. The probability of selecting segments of different sizes and their location within the genotype is uniform, ensuring a balanced exploration of the search space without introducing biases in the optimization.
2.5. Mutation Operator
The mutation operation on encoded individuals introduces random modifications in the population to preserve genetic diversity. This mechanism is crucial for preventing premature convergence, allowing the algorithm to explore a broader search space and reducing the risk of getting trapped in local optima. In this study, the bit-flip mutation operator is employed, a widely used technique in binary-encoded GAs. This operator selectively alters certain binary values within the genotype of each individual, flipping them with a predefined probability, which serves as a control parameter for the mutation intensity.
Figure 3 illustrates this process.
2.6. Fitness Evaluation
Fitness evaluation in a GA is the process of assessing how well each individual in the population solves the given problem. The fitness function quantitatively measures the quality of each solution based on a predefined criterion. In the case of feature selection problems, this criterion is the efficiency of the selected feature subset. The fitness score determines the likelihood of an individual being selected for reproduction, guiding the evolutionary process toward optimal solutions over successive generations.
In this work, we use the wrapper approach with 10-fold cross-validation to evaluate the quality of each individual (feature subset). The fitness of an individual is determined by training a machine learning model using the selected features, represented by the genotype, and evaluating its performance based on the accuracy metric. Furthermore, to encourage compact feature subsets, we include a penalty term to balance performance and feature reduction (Equation (
1)).
where
is the weight parameter,
is the number of selected features,
is the total number of features in the dataset, and Effectiveness is the the average accuracy across all test folds.
To mitigate overfitting, we define a second fitness function (Equation (
2)), which incorporates the variance of accuracy as an additional penalty factor. The smaller the variance in accuracy, the more stable the model’s behavior, meaning it performs consistently across all test runs. A lower variance indicates that the model generalizes well, reducing the likelihood of overfitting. This stability suggests that the algorithm is not overly dependent on specific training data patterns. Our hypothesis is that this will make it a better generalizer, more reliable for real-world applications.
2.7. Selection
Selection in a GA is the process of choosing individuals from the current population to create offspring for the next generation. The selection method directly influences the convergence and performance of the algorithm by favoring individuals with higher fitness scores while maintaining genetic diversity. The goal of selection is to balance exploration and exploitation, ensuring that the algorithm effectively searches the solution space while avoiding premature convergence.
In this work, we employ a binary tournament selection approach, where two individuals are randomly selected from the population, and the one with the higher fitness value is chosen as a parent for the next generation. This process is repeated until the required number of parents is selected. Binary tournament selection balances exploration and exploitation of the search space, allowing the most promising solutions to propagate across generations while preserving genetic diversity.
4. Material and Methods
This section describes the experimental methodology used to reproduce the previous experimentation and to evaluate the performance of feature selection with a GA meta-heuristic search on the Toxicity classification problem. In this study, we aimed to validate three main hypotheses. First, we hypothesize that the proposed validation method will demonstrate the presence of overfitting in the results reported by the previous study. Second, we aim to improve the results of the compared approach by leveraging an GA. Third, the use of variance of accuracy as a penalty factor will make it a better generalizer. The experiments were designed to assess the effectiveness of our approach under different conditions. We provide details on the dataset characteristics, the classifiers used in the experiments, the execution environment, and the parameters adopted in the GA.
4.1. Dataset
The dataset comes from the research by Gul et al. [
4]. Two problems are addressed in that research and we focused on the first one, detecting toxicity by molecular descriptors. The dataset is available in the UCI repository [
11] under the name Toxicity.
The Toxicity dataset was developed to evaluate the toxicity of molecules designed to interact with CRY1, a protein central to the regulation of the circadian rhythm. This biological clock influences numerous physiological processes and its disruption has been associated with diseases such as cancer and metabolic disorders.
The dataset (
Table 1) contains molecular descriptors for 171 molecules obtained by computational calculations, which can be used to train machine learning models capable of predicting whether a molecule is toxic or non-toxic. Each molecule in the dataset is represented by 1203 molecular descriptors, which include physicochemical, topological and structural properties. Examples of descriptors include the following:
Physicochemical properties: Molecular mass, logP (partition coefficient), number of hydrogen bonds.
Topological descriptors: molecular connectivity indices, number of cycles in the structure.
Electronic properties: Energy of orbitals, electrostatic potential.
These descriptors are generated by computational chemistry software and are commonly used in toxicity prediction models.
Non-toxic, the majority class, accounts for 67.25% of the instances. Therefore, a classifier that always predicts the majority class would achieve an accuracy of 67.25%. As a result, a good performance from any algorithm should exceed this baseline to demonstrate its effectiveness in distinguishing between classes.
4.2. Classifiers
The DTC, RFC, ETC, and XGBC classifiers used in [
4] and
kNN were employed for the experimentation. While the former are all tree-based ensemble methods,
kNN was included as a non-tree-based, instance-based learning algorithm to provide a comparative perspective and evaluate its performance under the same feature selection strategies.
kNN (
k-Nearest Neighbors) is a nonparametric instance-based learning algorithm used for classification and regression tasks [
12,
13]. It classifies a new instance by considering the K closest training examples in the feature space, typically measured using Euclidean distance or other distance metrics. The predicted class is determined by a majority vote among the nearest neighbors.
DTC (Decision Tree Classifier) is a nonparametric supervised learning algorithm, which is used for both classification and regression tasks [
14]. It has a hierarchical tree structure, consisting of a root node, branches, internal nodes and leaf nodes.
RFC (Random Forest Classifier) is a classifier that relies on combining a large number of uncorrelated and weak decision trees to arrive at a single result [
15].
ETC (Extra Trees Classifier) is a classification algorithm based on decision trees, similar to RFC, but with more randomization [
16]. Instead of searching for the best splits at each node, it randomly selects features and cutoff values. This makes it faster and less prone to overfitting. It works by creating a set of decision trees and makes predictions by majority vote.
XGBC (Extreme Gradient Boosting Clasiffier) is a machine learning algorithm based on boosting, which builds sequential decision trees to correct errors in previous trees [
17]. It is efficient, fast, and avoids overfitting thanks to its regularization. It uses gradient descent to optimize the model and is known for its high classification accuracy. However, it can be complex to fit and has less interpretability than other models.
4.3. Development and Running Environment
The GA for feature selection was programmed in Python, using the library DEAP (Distributed Evolutionary Algorithms in Python), an evolutionary computation framework for rapid prototyping and testing of ideas [
18]. It seeks to make algorithms explicit and data structures transparent. Likewise, we used Scikit-learn [
19], which is a free software machine learning library for the Python programming language. Experiments were run on a cluster of 6 nodes with Intel Xeon E5420 CPU 2.50GHz processor, under an Ubuntu 22.04 GNU/Linux operating system.
4.4. Experimental Parameters
The experimental setup initially attempted to mimic the study with which we compared ours, the original study that defined the dataset [
4]. The same grid search using train–test 10 fold cross-validation was applied to choose the parameter values from those described in
Table 2. This included the parameters for the
kNN classifier, which was incorporated into our experimentation to introduce diversity by considering a proximity-based classifier.
To evaluate the performance of the classification, the same internal cross-validation of ten partitions from grid search optimization was used in the original study. To obtain stable results, they repeated the experimentation 100 times. However, given that the number of combinations evaluated by grid search is in the order of the tens of thousands, overfitting is likely to occur, potentially leading to an overestimation of the expected accuracy. To address this issue, instead of repeating the experimentation 100 times, we utilized nested cross-validation with 100 folds. In this way, the results were expected to be similar in testing, as only a very small number of instances were omitted in each run, while providing a reliable estimate of the expected performance. The whole process was repeated 10 times to ensure a stable final validation result. This evaluation is illustrated in
Figure 4.
After reproducing the experimentation with the 13 features selected in the original study, the GA was applied with the goal of finding a better feature set fitting more effective classifiers. In order to configure the parameters of the GA, based on previous experimentation, we adopted the values proposed in [
20], which are detailed in
Table 3. Since using grid search inside the wrapper cross validation (used as fitness measure) is unfeasible, the default parameters from scikit-learn were used for the classifiers and 1 nearest neighbor for
kNN.
In order to find a good setup for feature selection GA, the parameter values shown in
Table 4 were tested.
As part of our comparative analysis, we also executed DE. Its parameter settings are summarized in
Table 5. The control parameters F and CR were selected from commonly used default values in the DE literature, based on preliminary experiments on other datasets. The initialization strategy was aligned with that of the genetic algorithm (GA), after observing that the typical default initialization value of 0.5 yielded inferior results during preliminary testing. To ensure a comparable computational effort, the same population size and number of generations were used for both algorithms.
5. Results Analysis
As an initial step, we replicated the experimental evaluation presented in [
4], using the same 13 selected features and sticking to the same classification processes and software for all classifiers, including the
kNN classifier. This replication ensures methodological consistency and enables a direct comparative analysis with the original results. The only difference is that we transformed the 100 repetitions into a second-level 100-fold cross-validation by prescinding one or two instances per repetition. Given the small number of omitted instances, their impact on the results is expected to be negligible, while it establishes a confident approach to evaluate potential improvements and assess the robustness of the models.
The results achieved using Recursive Feature Elimination (RFE) for each classifier in the original experimentation are shown in the first column of
Table 6. Then, they performed a second RFE and selected 13 features as the most relevant. This is shown in the second column and it is the part of the experimentation reproduced. The classification results obtained with this subset of features in the reproduced experimentation are shown in the following columns. It is unclear to us what the value reported in the original proposal is. We believe it must be the average test accuracy of the best model found, not the average of the 100 runs over the average test, because that value is closer to the best model column in
Table 6 (which is the best model according to the average of 100 repetitions of 10-fold CV test accuracies). With that consideration, the results obtained seem similar, confirming the validity of their experimentation.
Nevertheless, it is important to note that validation indicates that the model’s expected accuracy is around 74%, which is lower than the reported best test average of 79.63% from the original study (and the 77.65% from the reproduced equivalent result). It seems that, by repeating the train–test process many times, the feature selection and the parameters of the classifiers may be overfit to the test data.
The DTC classifier achieved the highest validation accuracy (74.09%), demonstrating its effectiveness when trained with the selected features. However, the performance drop observed for the remaining classifiers between testing and validation suggests that the selected characteristics may primarily favor the DTC, potentially limiting its generalization across different models. Furthermore, it is important to note that the performance of the other classifiers does not exceed the baseline of the majority class, indicating that these models provide very poor discriminative power. To highlight this, those models with validation accuracy over the majority rate are colored in green. These results highlight the need for further analysis of the feature selection and learning process to ensure robustness and generalization.
In the second phase of our study, the GA is used to perform feature selection and evaluate its impact on the different classifiers. As an example, the evolution process of one run of the GA is illustrated in
Figure 5.
The results in
Table 7 show that the 16 features selected by the GA wrapping RFC achieved the highest validation accuracy (71.64%) by using a penalty factor of 0.3, demonstrating that the evolutionary approach can improve the generalization capacity of the models studied, as this result is much higher than those achieved for RFC by using RFE in the previous study. In contrast, the DTC classifier, which had previously obtained the highest test accuracy, experienced a significant drop in validation, evidencing a possible overfitting of the training set, probably because it is using too many features.
In the third phase, we test the hypothesis that when using Equation (
2), the found feature sets are less prone to overfitting.
Table 8 shows the results of the same experiments presented in
Table 7 but using the new fitness calculation with a penalization for non homogeneous accuracy among test partitions. The results seem to improve but the differences are small and they cannot be considered conclusive. For this reason, we think that more research is needed to find strategies to avoid overfitting in these challenging high-dimensional datasets with few instances.
Table 9 compares the different methods analyzed, including RFE, the GA approach, the variance penalty version of the GA, and DE applied to the five classifiers
kNN, DTC, RFC, ETC, and XGBC. The results show that with RFE, DTC achieved the best validation accuracy (74.09%), separating itself from the other classifiers. However, in the GA and GA with variance penalty methods, RFC and XGBC showed superior performance, especially with penalties of 0.3 and 0.5. In particular, RFC achieved a remarkable validation performance (71.64%) with 16 features selected using the GA approach, while XGBC reached its maximum performance (70.35%) with 9 features and a penalty of 0.5. However,
kNN showed inferior performance in most cases, although a higher penalty (0.7) improved its accuracy by selecting only four features. These results suggest that GA and GA with variance penalty methods are effective in improving generalization, especially in ensemble classifiers such as RFC and XGBC, while DTC remains the best classifier with the 13 features selected in the original study. With a similar running time, DE have not improved any of the results from RFE or GA.
Although accuracy was used as the primary metric for comparison, the F1-score was also calculated to provide another assessment of model performance, given that this dataset has moderate class imbalance, which is common in medical datasets. This additional metric allows for an assessment of the balance between precision and recall in the classifiers evaluated [
21]. Therefore, its inclusion complements the accuracy metric by highlighting the trade-off between false positives and false negatives, which is especially relevant in medical decision-making contexts.
F1-score (
) is defined as the harmonic mean of precision and recall, providing a single measure that balances both concerns, as shown in Equation (
5),
where
Precision is the ratio of true positive predictions to the total number of positive predictions made, as shown in Equation (
6),
and
Recall is the ratio of true positive predictions to the total number of actual positive instances, as shown in Equation (
7).
ranges from 0 to 1, where 1 indicates perfect precision and recall and 0 indicates the worst performance. As we considered Toxic as the positive class, in this metric, the baseline established by majority classifiers, that we set to be beaten, is given by Equation (
8) or (
9), where
p is the probability of the majoritarian class (0.6725).
As in this case the majoritarian class is the negative class (Non-toxic), any classifier with over 0 is performing better than a majority classifier with respect to detecting toxicity.
The results in
Table 10 show that the best performing classifiers coincide with those identified by the accuracy metric. The values show that there is much space to improve in this challenging dataset.
6. Conclusions
The problem addressed is highly challenging, not only because it belongs to the class of high-dimensional datasets with few instances but also because, after extensive experimentation, achieving a genuine improvement in generalization over the majority class rate with high confidence appears to be highly unlikely.
After reproducing the experiments from the paper that introduced the problem, using an independent validation, we found that the actual expected accuracy is 4% lower than the reported test accuracy. This highlights the importance of avoiding repeated testing on the same data, as it increases the likelihood of obtaining a model that performs well under a specific test setup but fails to generalize.
Using the proposed GA to automate the FS process appears promising as, although it has not been able to improve the best model found using DTC in the original study, it has improved the FS performed with RFE for most of the classification models.
The use of the proposed variance penalty in the GA’s fitness function seems promising because it has achieved several better generalization results than the non-penalized version. However, it deserves more research to finetune it and prove its performance in different datasets.