Binary or Integer Chromosome: Which Is the Best Structure for Supervised Machine Learning Using Genetic Algorithms?

Alves, Alexandre Henrick da Silva; Neto, Guilherme Antonio Coelho; Gomes, Matheus de Souza; Santos, Líbia Diniz; Bertarini, Pedro Luiz Lima; do Amaral, Laurence Rodrigues

doi:10.3390/app15052608

Open AccessArticle

Binary or Integer Chromosome: Which Is the Best Structure for Supervised Machine Learning Using Genetic Algorithms?

by

Alexandre Henrick da Silva Alves

¹,

Guilherme Antonio Coelho Neto

¹,

Matheus de Souza Gomes

²

,

Líbia Diniz Santos

³

,

Pedro Luiz Lima Bertarini

⁴

and

Laurence Rodrigues do Amaral

^1,*

¹

Faculty of Computation, Federal University of Uberlândia, Patos de Minas Campus, Patos de Minas 38701-002, Brazil

²

Biotechnology Institute, Federal University of Uberlândia, Patos de Minas Campus, Patos de Minas 38700-128, Brazil

³

Faculty of Chemical Engineering, Federal University of Uberlândia, Patos de Minas Campus, Patos de Minas 38700-002, Brazil

⁴

Faculty of Electrical Engineering, Federal University of Uberlândia, Patos de Minas Campus, Patos de Minas 38700-002, Brazil

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(5), 2608; https://doi.org/10.3390/app15052608

Submission received: 10 January 2025 / Revised: 19 February 2025 / Accepted: 22 February 2025 / Published: 28 February 2025

(This article belongs to the Special Issue Intelligent Data Analysis with the Evolutionary Computation Methods)

Download

Browse Figures

Versions Notes

Abstract

Supervised machine learning is widely researched nowadays. Several works have already been developed using genetic algorithms (GAs) for classification tasks evolving IF-THEN classification rules. Oftentimes, these methods are built using integers and real values from one’s chromosome structure. In this paper, new and important improvements are proposed to Non-linear Computation Evolutionary Environment (NLCEE), a GA-based rule-set generator proposed by Amaral and Hruschka. The proposed GA, called BIN-NLCEE, uses binary representation in its chromosome structure to simplify its mutation and also produce a higher search space. The main goal is to have a rule-set generator that produces simple and interpretable classification rules with good accuracy values and better converge rates. The BIN-NLCEE performance was compared with other GAs-based and four traditional classifiers in five medical domain datasets. The results showed a better convergence rate and higher fitness values for BIN-NLCEE when compared with the GA-based CEE and NLCEE. In 20 comparisons, BIN-NLCEE achieved better results in 9 (45%), and, according to the confidence interval, equivalent results were obtained in 11 (55%). In this way, BIN-NLCEE was better or equal to NLCEE and CEE in 100% of the comparisons. Also, BIN-NLCEE outperformed all traditional classifiers’ results, i.e., achieved better results in 100% of comparisons.

Keywords:

supervised machine learning; genetic algorithms; binary chromosome; integer and real chromosome; IF-THEN rules; nonlinear datasets

1. Introduction

Genetic algorithms (GAs) are computational methods of search and optimization based on the principles of evolution by natural selection [1]. GAs’ operators are based on the mechanisms of genetics and heredity [2], allowing for an efficient search space exploration. They are commonly used when the search space is very large, and they can often find good solutions.

Data mining (DM) is an interdisciplinary domain focused on leveraging methods and tools to uncover valuable insights and patterns within large datasets. Among its diverse applications, the classification task remains one of the most extensively explored areas, aiming to assign predefined labels to data instances based on their attributes [3].

The integration of artificial intelligence (AI) and machine learning (ML) into healthcare and decision support systems has gained significant attention in recent years. Kocadagli and collaborators [4] proposed a hybrid ML approach incorporating genetic algorithms for feature selection to evaluate COVID-19 severity based on clinical and hematological parameters. Similarly, Rahimi and collaborators [5] introduced a quantum-inspired interpretable AI system for early rheumatoid arthritis detection, emphasizing the need for explainability in AI-driven medical diagnostics. Mahya and Fürnkranz [6] conducted an empirical study comparing post hoc explanations with directly interpretable models, demonstrating that interpretable models can often approximate black box models with comparable fidelity. These studies collectively highlight the importance of model interpretability and robustness in AI applications, reinforcing the need for explainable ML techniques in critical decision-making systems.

Over the years, various classification techniques have been proposed and refined, including support vector machines (SVM) and artificial neural networks (ANN). Despite their popularity, these approaches are typically categorized as black box models, as they generate outputs without providing interpretable explanations, such as decision trees or IF-THEN rules. In recent years, several works were developed using GAs as a method for classification, as can be seen in the following descriptions:

In [7], the study presented a hybrid classification approach that combined genetic algorithms (GAs) with the k-nearest neighbors (k-NN) algorithm. The GA was employed to optimize real-valued weights for individual attributes in the training dataset, refining feature importance in the classification process. The proposed hybrid model was evaluated on three test cases, demonstrating superior classification performance compared to the traditional k-NN algorithm in all instances. This approach highlights the effectiveness of evolutionary computation in improving distance-based classification methods.

In [8], the study introduced a full-memory genetic algorithm (GA) that integrated task-specific inference rules from inductive learning. By aligning GA operations with problem-specific heuristics, the approach enhanced concept learning and domain knowledge incorporation. Initial results demonstrate its effectiveness, and the framework can be extended to other problem-solving domains.

In [9], the study presented a classification algorithm based on genetic algorithms (GAs) designed to extract interpretable IF-THEN rules for data mining tasks. The proposed GA employed a flexible chromosome encoding, allowing for a fixed genotype while maintaining a variable phenotype. The effectiveness of the approach was demonstrated through experiments on medical datasets, proving its capability to generate human-readable rules for classification.

In [10], the computational evolutionary environment (CEE) was introduced, a genetic algorithm-based method for extracting classification rules from biological datasets. The primary objective of CEE is to balance classification accuracy, interpretability, and comprehensibility by discovering concise, high-level rules from biological databases. The results indicate that CEE can generate useful knowledge that traditional classifiers fail to extract.

In [11], GANEL was proposed, a genetic algorithm (GA) that incorporates never-ending learning (NEL) principles for gene ontology classification. This approach aims to generate concise, high-level classification rules that capture stronger biological patterns while balancing accuracy and interpretability. The results demonstrate that GANEL outperforms CEE, showing higher accuracy with a smaller attribute set, thus enhancing biological data analysis.

In [12], a new genetic algorithm operator, named Transgenic, inspired by genetically modified organisms (GMOs) was proposed. The operator artificially inserted relevant features into chromosomes, enhancing evolutionary convergence. By leveraging both a priori knowledge and historical data, Transgenic enabled GAs to identify and prioritize important attributes, leading to better performance in rule classification tasks compared to traditional GA operators.

In [13], MDRGA (multiple disjunctions rule genetic algorithm) was proposed, a GA tailored for inducing non-linear IF-THEN classification rules. MDRGA is designed to improve classification accuracy while maintaining interpretability. The approach was evaluated against traditional classifiers and GA-based methods (CEE, NLCEE) on non-linear datasets, consistently achieving superior classification accuracy.

In [14], the study presented a hybrid approach integrating GAs with feature selection to identify relevant gene subsets and derive high-level classification rules for cancer datasets (NCI60). The results demonstrated that the selected gene sets provided strong predictive correlations with disease classes while maintaining interpretability and surpassing traditional classification techniques in accuracy.

In [15], the research proposed an architecture combining GAs and traditional classifiers (decision trees, naive Bayes) for assisting biocurators in gene ontology (GO) classification. The system balanced accuracy, interpretability, and comprehensibility, achieving high classification rates while reducing manual annotation effort.

In [16], the study extended the computational evolutionary environment (CEE) by incorporating information gain (IG) for attribute selection, resulting in the IG-CEE method. IG-CEE improves classification accuracy and convergence rates by focusing on the most relevant attributes for rule construction. Comparisons of the CEE to traditional classifiers demonstrated superior classification performance.

In [17], the study presented a novel approach for designing antimicrobial peptides (AMPs) by integrating GAs with rough set theory, a transparent machine learning method. The proposed system employed supervised learning and in vitro bacterial assays to optimize peptide activity against S. epidermidis, a strain associated with implant infections. This is the first application of codon-based genetic algorithms combined with rough set theory for computational peptide sequence design.

In [18], the study proposed GADIC, a novel classification algorithm combining data importance (DI) reformatting and GAs to improve machine learning classifier performance. Evaluated on five classifiers (SVM, KNN, LR, DT, and NB) and seven datasets, GADIC significantly enhanced accuracy, with improvements ranging from 5.96% to 16.79%, except for a minor 1% decrease in NB on one dataset. KNN showed the highest performance gain, while LR had the lowest improvement.

As noted by Amaral and Hruschka Junior [11], achieving high accuracy is essential but is not the sole objective in classification tasks. Interpretability and comprehensibility are equally critical, especially in practical applications where understanding the reasoning behind predictions is vital.

In [10], a GA was proposed to mine IF-THEN understandable rules. The experiments were conducted using the gene ontology (GO) [19] database for the classification of gene products into GO categories based on structured attributes related to biological process, cellular component, and molecular function, and they showed promising results. The GA proposed was named the computational evolutionary environment (CEE), and it was able to produce good and understandable rules. In [20], an extension of the CEE was proposed, called the non-linear computation evolutionary environment (NLCEE), to produce nonlinear classification rules. After that, Matos and Amaral [13] proposed a new GA for nonlinear classification, extending the NLCEE. The proposed algorithm was called the multiple disjunctions rule genetic algorithm (MDRGA), and it brought a new and more complex chromosome representation.

In all cited works, the GAs were implemented using integer and real values in one’s chromosome. Thus, seeking improvements in the aforementioned GA, a new refined NLCEE algorithm is proposed that uses binary representation. The proposed version is called the binary nonlinear computational evolutionary environment (BIN-NLCEE).

The main goals of this work are to verify the effects of binary representation, analyzing convergence and accuracy rates, and create a refined rule set generator. For comparison purposes, the CEE and NLCEE were implemented, and their results were compared with the BIN-NLCEE using five medical domain datasets. Four datasets were obtained from the UCI Machine Learning Repository, and one was built by our team using real-world samples. More details about the datasets are shown in Section 3. Furthermore, four traditional classifiers were used to compare with the GAs-based methods, namely J48, IBK, naive Bayes, and SVM.

The paper is organized as follows: Section 2 shows how GAs are used for IF-THEN classification rules mining. Section 3 presents the datasets used in this work. Section 4 contains the description of the proposed algorithm and the main contributions of this scientific research. In Section 5, the results are presented, and a comparative analysis with the CEE, NLCEE, and other traditional methods is conducted. Last, in Section 6, the conclusions and future works are discussed.

2. GAs for Classification Task

2.1. Computational Evolutionary Environment (CEE)

Fidelis and colleagues [9] introduced a genetic algorithm (GA) capable of generating IF-THEN classification rules, a development that laid the foundation for subsequent advancements in GA-based methods. One notable example is the computational evolutionary environment (CEE) [10], designed specifically to extract IF-THEN rules from the gene ontology (GO) database. Its chromosome structure was inspired by the earlier work of Fidelis and collaborators, with each gene representing a distinct attribute of the dataset.

In CEE, the chromosome structure is composed of three components for each gene: weight (W_i), operator (O_i), and value (V_i), as shown in Figure 1. In the CEE approach, the dataset used contained 34 attributes, which is why the notation ‘Gene₃₄’ is employed in Figure 1 to accurately represent the number of features in the dataset. These components together define conditions for the antecedent (IF part) of a rule, while the entire chromosome represents the complete antecedent. The weight field (W_i) determines the inclusion of a condition based on its threshold, enabling flexibility in rule generation. If the weight surpasses the threshold, the condition is incorporated; otherwise, it is omitted.

In CEE, the weight field can be a number between 1 and 10 (W_i = 1,…, 10), and its threshold is usually 8 or 9. The operator (O_i) field determines which relational operator will appear in the condition, and the two possible operators are ≥ or <. Finally, the value (V_i) field represents a numeric value in the attribute domain.

The GA iteratively searches for optimal classification rules by executing n runs, one for each target class. The evaluation of individuals (rules) relies on a fitness function (FF) that quantifies their classification performance. When evaluating a rule for a class C, four outcomes are possible depending on the prediction [21]:

True Positive (tp): The predicted class matches C, and the case belongs to C;
True Negative (tn): The predicted class does not match C, and the case does not belong to C;
False Positive (fp): The predicted class matches C, but the case does not belong to C;
False Negative (fn): The predicted class does not match C, but the case belongs to C.

To enhance classification performance, the FF combines two widely used medical domain indicators: sensitivity (Se) and specificity (Sp), which are mathematically defined as

S e = \frac{t p}{t p + f n}

(1)

S p = \frac{t n}{t n + f p}

(2)

The FF aims to maximize these metrics simultaneously, as demonstrated in Equation (3). Consequently, CEE operates as a binary classifier, treating all non-target classes as a single group (one-vs-all strategy) during each execution.

f i t n e s s = \frac{S e + S p}{2}

(3)

2.2. Nonlinear Computational Evolutionary Environment (NLCEE)

The NLCEE, proposed in [20], aims to classify datasets with nonlinear distribution. For this, the only difference between CEE and NLCEE is the gene structure. As shown in Figure 2, NLCEE genes have two parts, left and right, thus allowing the gene to represent superior and inferior intervals of the attribute or between an interval.

Therefore, the i-th gene now has six fields. The boundary value of the weight field was 9, and the operators and FF were maintained the same as CEE [10].

3. Datasets

In this work, five medical domain datasets were used: breast-w, Pima-Indians-diabetes, Parkinson’s [22], mammography mass, and COVID-19. The first four datasets can be downloaded from the UCI Machine Learning Repository (https://archive.ics.uci.edu/ml (accessed on 9 January 2025)) [23].

The COVID-19 dataset was built by our team using the public data shared by the Fleury Group in Brazil. The original data can be download in (https://repositoriodatasharingfapesp.uspdigital.usp.br/handle/item/99 (accessed on 9 January 2025)). We completed several preprocessing tasks to select and format the data to the data mining classification task. Thus, our team conducted a deep bibliography analysis to search the main traditional clinical exams that can be used to classify patients correctly. The attributes of the dataset are related to white blood cells, red blood cells, mean corpuscular volume, serum ferritin, erythrocyte sedimentation rate, C-reactive protein, albumin, lactate dehydrogenase, hematocrit, and platelets.

Table 1 shows a summary of all datasets used in this study. Pima-Indians-diabetes and mammography mass datasets will be called here diabetes and mammography, respectively. The breast-w dataset contains information about breast cancer images. It has 32 attributes and 569 samples, classified in 2 classes named M (malignant) and B (benign). The diabetes dataset aims to predict whether or not a patient has diabetes based on the variables available. All samples are from female patients at least 21 years old. It contains 9 attributes, 768 samples distributed in 2 classes, tested_negative and tested_positive. The Parkinson’s dataset contains information about voice measurements from 31 patients. It contains 23 attributes, 195 samples, and 2 classes, healthy and PD (Parkinson’s disease). The mammography dataset aims to predict the severity of a mammography mass lesion seen in a mammogram. It contains 5 attributes, 961 samples, and 2 classes called benign and malignant. Finally, the COVID-19 dataset has 7 attributes, 3158 samples, and 2 classes (class 1 (negative) and class 2 (positive)). Table 2, Table 3, Table 4, Table 5 and Table 6 show the samples distribution for each class of each dataset.

4. Binary Nonlinear Computational Evolutionary Environment (BIN-NLCEE)

4.1. Introduction

The evolutionary environment proposed in this paper builds upon the non-linear computation evolutionary environment (NLCEE) [20], introducing a refinement through the use of binary representation. The primary contribution of this approach lies in its ability to enhance the search process by allowing simple bit-flip mutations, which significantly impact the exploration of the solution space and help avoid local minima. This characteristic improves the algorithm’s robustness while maintaining computational efficiency. Additionally, the binary encoding simplifies both the interpretability and the implementation of the method, making it more accessible and adaptable to different classification tasks. The following sections will provide a more detailed discussion of these advantages.

4.2. Individual Representation

For this proposal, each field of a gene is a binary value. In the operator field, the value 0 was set as ≥ and the value 1, <. When converting a positive integer value, the binary representation could bring some limitations. In CEE and NLCEE, the weight value is a positive integer between 1 and 10, so the binary conversion cannot represent all the values appropriately. This happens because there is no specific number of bits that can represent all the values in the interval [1…10]. To represent values in the interval [1…7], three bits are necessary, and in the interval [8 to 10], four bits are necessary.

To overcome this problem, two different approaches were proposed. For the weight field, the values are now on a scale [0…≈1] (W_i =

w_{i},_{j}

|

w_{i},_{j}

= 0…≈1), and the conversion is achieved using Equation (4).

d = \sum b_{i} 2^{- i}

(4)

where b_i is the bit digit. Now, the number of values of the weight field can be 2ⁿ, where n is the number of bits used to represent the value. In this work, the number of bits used was 6, i.e., 2⁶ possible values.

The value field is also on a scale [0…≈1], and the conversion is conducted with Equation (4). Since the value field holds an attribute domain value, the dataset attributes need to be transformed, and this transformation is achieved using a min–max normalization, as can be seen in Equation (5).

x^{'} = \frac{x - m i n (A)}{m a x (A) - m i n (A)}

(5)

In Equation (5), x is the original attribute (A) value. Therefore, the value field also has 2ⁿ possible values. Figure 3 shows an example of conversion used in this work. The conversion and the normalization are only used to calculate the fitness of each individual. To facilitate understanding, only the left part of the chromosome is represented in Figure 3. The right part of the chromosome has the same structure as the left side.

As the NLCEE fields have different types, the mutation operator needs to use three different approaches, while the new binary representation uses only a simple random bit swap strategy. This strategy increases the search space to 2ⁿ, adding more exploration possibilities. Another difference between NLCEE and BIN-NLCEE is in the mutation rate. In NLCEE, the mutation rate is set for the individuals and the population, i.e., a certain percentage of the population will be mutated. In BIN-NLCEE, however, the rate is set only for the individuals, so all individuals are just as likely to be mutated.

4.3. Genetic Operators and Parameters

In this work, the stochastic tournament was used with tour size 3 in selection method and a 2-point crossover with 100% probability in crossover. The mutation operator simply swaps the bit value, since the binary representation is now used. The weight field threshold was set to 0.8, 100 individuals were used, there was a 40% mutation (the same rate used by NLCEE) rate, and there were 100 generations. This configuration was obtained after testing various configurations, such as

Population size: 50 and 100;
Generations: 50 and 100;
Mutation rate (%): 40 and 50;
Selection method: Roulette and Stochastic tournament;
Weight field: 6, 7, and 8;
Number of bits: 5, 6, 7, and 8.

To define the best configuration, the environment was run 100 times for each possible configuration (based on the aforementioned list) using different random seeds, and the obtained results were compared considering a confidence interval test using α = 0.05 (95% confidence). As a validation method, k-fold cross-validation was used with k = 10.

5. Results

Table 7 and Table 8 show the results for the best configuration of CEE, NLCEE, and BIN-NLCEE for all datasets and their classes. Since the methods based on GA run n times for n classes, and the datasets used in this work are binary classification, the results are presented separately for each class. The first column shows the dataset and the following class used to find the classification rules. The remaining columns show the fitness results in the format (

A v e r a g e V a l u e \pm C o n f i d e n c e V a l u e

) of the 100 executions for each method. Full results of the UCI datasets can be seen in (https://figshare.com/articles/figure/11562996 (accessed on 9 January 2025)) and for the COVID-19 dataset in (https://figshare.com/articles/figure/13650476 (accessed on 9 January 2025)).

As can be seen, good fitness values were obtained in all three methods. For the breast-w dataset, BIN-NLCEE obtained better results than CEE, with a fitness of 97.89 (ranging from 97.54 to 98.24) for the malignant class and 98.44 (98.17 to 98.73) for the benign class. CEE obtained 96.45 (96.03 to 96.87) for malignant and 97.46 (97.13 to 97.79) for benign. NLCEE obtained a fitness of 97.18 (ranging from 96.81 to 97.55) for the malignant class and 98.09 (97.76 to 98.42) for the benign class, which is statistically equal to BIN-NLCEE.

For the diabetes dataset, BIN-NLCEE obtained better results than NLCEE for both classes. BIN-NLCEE achieved a fitness of 83.34 in the tested_negative class (ranging from 82.69 to 83.99) and 82.34 (81.7 to 82.98) in the tested_positive class. NLCEE achieved a fitness value of 82.00 (81.35 to 82.65) in the tested_negative class and 80.93 (80.27 to 81.59) in the tested_positive class. If compared with CEE results, BIN-NLCEE also obtained better results in both classes, where CEE achieved a fitness of 80.92 (80.26 to 81.58) in the tested_negative class and 80.21 (79.49 to 80.93) in the tested_positive class.

For the Parkinson’s dataset, BIN-NLCEE almost obtained better results for the healthy class than NLCEE, obtaining a fitness of 99.27 (ranging from 98.94 to 99.60) while NLCEE achieved a fitness of 98.53 (98.11 to 98.95). CEE achieved a fitness of 97.92 (97.41 to 98.43), worse than the fitness value of BIN-NLCEE and equal to the fitness value of NLCEE. For the PD class, all three methods obtained the same results according to the confidence value. BIN-NLCEE obtained a fitness of 97.34 (ranging from 96.79 to 97.89) while NLCEE and CEE obtained 95.78 (94.66 to 96.90) and 96.13 (95.41 to 96.85), respectively.

For the mammography dataset, all three methods obtained equal results according to the confidence value. BIN-NLCEE achieved a fitness of 85.86 (ranging from 85.21 to 86.51) for the benign class and 84.23 (83.64 to 84.82) for the malignant class. NLCEE and CEE achieved a fitness of 85.64 (85.00 to 86.28) and 85.23 (84.60 to 85.86) for the benign class and 84.28 (83.64 to 84.92) and 83.78 (83.16 to 84.40) for the malignant class, respectively.

Finally, for the COVID-19 dataset, BIN-NLCEE achieved better results in the tested_negative class compared to CEE. BIN-NLCEE obtained an average fitness of 64.56 (64.17 to 64.95), NLCEE obtained an average fitness of 64.22 (63.81 to 64.63), and CEE obtained an average fitness of 63.52 (63.13 to 63.91). In the tested_positive class, BIN-NLCEE achieved better results compared to all other methods. BIN-NLCEE obtained an average fitness of 65.06 (64.67 to 65.45), NLCEE obtained an average fitness of 63.62 (63.28 to 63.96), and CEE obtained an average fitness of 62.91 (62.54 to 63.28).

Thus, in 20 comparisons, the BIN-NLCEE achieved better results in 9 (45%) and, according to the confidence interval, equal results in 11 (55%). In this way, the BIN-NLCEE was equal to or better than NLCEE and CEE in 100% of the comparisons, not being worse in any results.

Table 9 shows the results obtained by four traditional methods used for the classification task: J48 [24], IBK [25], naive Bayes (NB) [26], and SVM [27]. These methods were run using Weka [28], a machine learning and data mining open-source software. To make a fair comparison, the methods were evaluated using the same evaluation method as the GAs, shown in Equation (3). The first column presents the dataset name and the remaining ones; the results are in the same format as Table 7 and Table 8.

For the breast-w dataset, SVM achieved better results, with a fitness of 97.02 (ranging from 96.96 to 97.08), while J48, IBK, and NB obtained 92.94 (92.77 to 93.11), 95.07 (from 94.99 to 95.15), and 92.07 (92.01 to 92.13), respectively.

NB achieved better results in the diabetes dataset, with fitness of 72.11 (ranging from 72.00 to 72.22), while J48, IBK, and SVM obtained fitness levels of 70.66 (70.41 to 70.91), 66.65 (66.49 to 66.81), and 71.38 (71.28 to 71.48), respectively.

For the Parkinson’s dataset, IBK obtained better results, with fitness of 95.56 (ranging from 95.34 to 95.78). J48 achieved the second best result, with a fitness of 79.70 (from 79.09 to 80.31). NB and SVM obtained 76.80 (from 76.59 to 77.01) and 74.94 (from 74.73 to 75.15).

In the mammography dataset, J48 obtained better results, with fitness of 81.57 (from 81.46 to 81.68), while IBK, NB, and SVM obtained fitness levels of 74.68 (74.53 to 74.83), 77.80 (77.70 to 77.90), and 79.65 (79.56 to 79.74).

Finally, for the COVID-19 dataset, naive Bayes obtained better results, with fitness 57.69 (from 57.60 to 57.78), while J48, IBK, and SVM obtained fitness levels of 56.39 (from 56.26 to 56.52), 54.89 (from 54.78 to 55.00), and 50.15 (from 50.08 to 50.22), respectively.

BIN-NLCEE generates two results, one per class, as shown in Table 7 and Table 8. Due to this and the medical domain of the data, we choose the results found for positive samples (shown in Table 8). Table 10 shows the fitness difference between BIN-NLCEE and traditional classifiers. In Table 10, the smallest difference is shown in column S, and the biggest difference is shown in column B. Positive values represent better results for BIN-NLCEE compared to traditional classifiers, and negative values represent worse results.

Analyzing the results globally, BIN-NLCEE obtained better results and was able to achieve lower variance when compared with all traditional classifiers. Between traditional classifiers, for each dataset, a different method obtained better fitness values. In this way, comparing BIN-NLCEE with the traditional classifiers, it is possible to see that the proposed method obtained better results for all datasets, i.e., achieved better results in 100% of comparisons.

For the breast dataset, the fitness difference varied from 0.46 (SVM (S) to 6.23 (Naive Bayes (B)). As can be seen, big differences were obtained for the diabetes dataset, which varied from 9.48 (obtained by SVM S) to 16.49 (obtained by IBK B). For the Parkinson’s dataset, the fitness difference varied from 1.01 (IBK (S)) to 23.16 (SVM (B)). For the mammography dataset, the fitness difference varied from 1.96 (J48 (S)) to 10.29 (IBK (B)). Finally, for the COVID-19 dataset, the fitness difference varied from 6.89 (NB (S)) to 15.37 (SVM (B)).

BIN-NLCEE had higher fitness values according to the results shown previously, and it is important to understand deeply the reasons for that. Figure 4, Figure 5, Figure 6, Figure 7, Figure 8, Figure 9, Figure 10, Figure 11, Figure 12 and Figure 13 show the convergence comparison between the BIN-NLCEE, NLCEE, and CEE for all five datasets. A random seed was chosen (seed 27) to compare the convergence difference in execution between the three methods. As can be seen, BIN-NLCEE achieves better convergence rates in the breast-w, diabetes, and Parkinson’s datasets.

Also, for the COVID-19 dataset, BIN-NLCEE obtained a better convergence rate for the tested_positive class. We believe that this can be explained due to the binary representation.

CEE and NLCEE use integer and real values in weight and value fields of the gene and the mutation operator work, adding or subtracting a value on the field. On the other hand, BIN-NLCEE works by changing the bit value, and this can make the exploration of the search space more efficient, avoiding local minima.

Table 11 presents a rule set obtained by the BIN-NLCEE method. The first column shows the dataset name, and the second column shows the rule for each class. It is possible to see that the rules are very understandable and simple.

With this approach, the BIN-NLCEE generates several rules for each dataset. In this way, it is possible to choose among several possibilities those rules with the highest fitness value for testing.

6. Final Remarks

As aforementioned, GAs have been developed to find good classification rules. Based on that, it is important to keep improving these methods to achieve better results. In this work, a refinement of the NLCEE was proposed, changing its chromosomal structure by a binary representation.

The proposed method was executed using five medical domain datasets compared with two GAs and four traditional classifiers. The results showed that the proposed BIN-NLCEE achieved better results than traditional classifiers. When compared with the GAs, the BIN-NLCEE achieved better fitness values and converge rates. Therefore, it is possible to conclude that the binary representation for these datasets can improve GA performance.

As mentioned before, the proposed method provides a rule set by the end of its execution, thus allowing the specialist to choose the rule based on its fitness and interpretability. Thus, the BIN-NLCEE can be considered a good specialist-oriented method.

As future works, we hope to test the proposed GA with a higher number of bits in its binary representation, build IF-THEN rules in different and more complex datasets (real-world medical and biological datasets), and test other genetic parameters, such as mutation rate. Also, we would like to implement and investigate a new approach to select the best attributes to compose the rule by using an attribute evaluation method for high-dimensionality datasets.

Author Contributions

Conceptualization, L.R.d.A. and M.d.S.G.; methodology, A.H.d.S.A.; software, A.H.d.S.A. and G.A.C.N.; validation, M.d.S.G. and L.D.S.; formal analysis, P.L.L.B. and L.R.d.A.; investigation, A.H.d.S.A.; resources, A.H.d.S.A. and G.A.C.N.; data curation, A.H.d.S.A. and G.A.C.N.; writing—original draft preparation, A.H.d.S.A. and G.A.C.N.; writing—review and editing, L.R.d.A., L.D.S., P.L.L.B. and M.d.S.G.; visualization, L.R.d.A., L.D.S., P.L.L.B. and M.d.S.G.; supervision, L.R.d.A.; project administration, L.R.d.A. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Acknowledgments

The authors would like to thank CAPES, CNPq, FAPEMIG, and PROPP/UFU for financial support.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Goldberg, D.E. Genetic Algorithms in Search, Optimization, and Machine Learning; Addison-Wesley: New York, NY, USA, 1989. [Google Scholar]
Holland, J.H. Adaptation in Natural and Artificial Systems: An Introductory Analysis with Applications to Biology, Control, and Artificial Intelligence; University of Michigan Press: Ann Arbor, MI, USA, 1975. [Google Scholar]
Tan, P.N.; Steinbach, M.; Kumar, V. Introduction to Data Mining; Pearson Education India: London, UK, 2016. [Google Scholar]
Kocadagli, O.; Baygul, A.; Gokmen, N.; Incir, S.; Aktan, C. Clinical Prognosis Evaluation of COVID-19 Patients: An Interpretable Hybrid Machine Learning Approach. Curr. Res. Transl. Med. 2022, 70, 103319. [Google Scholar] [CrossRef] [PubMed]
Rahimi, S.A.; Kolahdoozi, M.; Mitra, A.; Salmeron, J.L.; Navali, A.M.; Sadeghpour, A.; Mohammadi, S.A.M. Quantum-Inspired Interpretable AI-Empowered Decision Support System for Detection of Early-Stage Rheumatoid Arthritis in Primary Care Using Scarce Dataset. Mathematics 2022, 10, 496. [Google Scholar] [CrossRef]
Mahya, P.; Fürnkranz, J. An Empirical Comparison of Interpretable Models to Post-Hoc Explanations. AI 2023, 4, 426–436. [Google Scholar] [CrossRef]
Kelly, J.; Davis, L. A Hybrid Genetic Algorithm for Classification. In Proceedings of the Twelveth International Joint Conference on Artificial Intelligence, Sidney, Australia, 24–30 August 1991; pp. 645–650. [Google Scholar]
Janikow, C.Z. A knowledge-intensive genetic algorithm for supervised learning. Mach. Learn. 1993, 13, 189–228. [Google Scholar] [CrossRef]
Fidelis, M.V.; Lopes, H.S.; Freitas, A.A. Discovering Comprehensible Classification Rules a Genetic Algorithm. In Proceedings of the 2000 Congress on Evolutionary Computation CEC00, La Jolla Marriott Hotel, La Jolla, CA, USA, 6–9 July 2000; Volume 1, pp. 805–810. [Google Scholar]
do Amaral, L.R.; Hruschka, E.R. Gene ontology classification: Building high-level knowledge using genetic algorithms. In Proceedings of the IEEE Congress on Evolutionary Computation, Barcelona, Spain, 18–23 July 2010; pp. 1–7. [Google Scholar]
do Amaral, L.R.; Junior, E.R.H. Never-ending learning principles in gene ontology classification using genetic algorithms. In Proceedings of the 2012 IEEE Congress on Evolutionary Computation, Brisbane, Australia, 10–15 June 2012; pp. 1–8. [Google Scholar]
do Amaral, L.R.; Hruschka, E.R., Jr. Transgenic: An evolutionary algorithm operator. Neurocomputing 2014, 127, 104–113. [Google Scholar] [CrossRef]
Matos, M.D.S.; Do Amaral, L.R. Multiple disjunctions rule genetic algorithm (MDRGA): Inferring non-linear IF-THEN rules in non-linear datasets. In Proceedings of the 2018 IEEE Congress on Evolutionary Computation (CEC), Rio de Janeiro, Brazil, 8–13 July 2018; pp. 1–6. [Google Scholar]
Silva, R.G.O.; de Souza Ribeiro, M.W.; do Amaral, L.R. Building high level knowledge from high dimensionality biological dataset (NCI60) using genetic algorithms and feature selection strategies. In Proceedings of the 2013 IEEE Congress on Evolutionary Computation, Cancun, Mexico, 20–23 June 2013; pp. 578–583. [Google Scholar]
do Amaral, L.R.; da Silva Alves, A.H.; de Lima Mendes, R.; de Souza Gomes, M.; Bertarini, P.L.L.; Hruschka, E.R. Applying Never-Ending Learning (NEL) Principles to Build a Gene Ontology (GO) Biocurator. In Proceedings of the 2021 IEEE Congress on Evolutionary Computation (CEC), Krakow, Poland, 28 June–1 July 2021; pp. 458–465. [Google Scholar]
da Silva Alves, A.H.; de Lima Mendes, R.; de Souza Gomes, M.; Bertarini, P.L.L.; do Amaral, L.R. IG-CEE: An Embedded Information Gain approach to Genetic Algorithms. In Proceedings of the 2021 IEEE Congress on Evolutionary Computation (CEC), Krakow, Poland, 28 June–1 July 2021; pp. 1086–1092. [Google Scholar]
Boone, K.; Wisdom, C.; Camarda, K.; Spencer, P.; Tamerler, C. Combining genetic algorithm with machine learning strategies for designing potent antimicrobial peptides. BMC Bioinform. 2021, 22, 239. [Google Scholar] [CrossRef] [PubMed]
Alkhayyata, A.; Hewahi, N. A Novel Machine Learning Classifier Based on Genetic Algorithms and Data Importance Reformatting. arXiv 2024, arXiv:2412.13350. [Google Scholar]
Ashburner, M.; Ball, C.A.; Blake, J.A.; Botstein, D.; Butler, H.; Cherry, J.M.; Davis, A.P.; Dolinski, K.; Dwight, S.S.; Eppig, J.T.; et al. Gene ontology: Tool for the unification of biology. Nat. Genet. 2000, 25, 25–29. [Google Scholar] [CrossRef] [PubMed]
do Amaral, L.; Junior, E.H. Non-linear computational evolutionary environment (nlcee): Building high-level knowledge in complex biological databases. In Proceedings of the ECML/PKDD-European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases. No Workshop: Data Mining in Functional Genomics and Proteomics: Current Trends and Future Directions, Athens, Greece, 5–9 September 2011. [Google Scholar]
Lopes, H.S.; Coutinho, M.S.; de Lima, W. An evolutionary approach to simulate cognitive feedback learning in medical domain. In Genetic Algorithms and Fuzzy Logic Systems: Soft Computing Perspectives; World Scientific: Singapore, 1997; pp. 193–207. [Google Scholar]
Little, M.; Mcsharry, P.; Roberts, S.; Costello, D.; Moroz, I. Exploiting nonlinear recurrence and fractal scaling properties for voice disorder detection. Biomed. Eng. OnLine 2007, 6, 23. [Google Scholar] [CrossRef] [PubMed]
Asuncion, A.; Newman, D. UCI Machine Learning Repository; School of Information and Computer Science, University of California: Irvine, CA, USA, 2007. [Google Scholar]
Quinlan, J.R. C4. 5: Programs for Machine Learning; Elsevier: Amsterdam, The Netherlands, 2014. [Google Scholar]
Aha, D.W.; Kibler, D.; Albert, M.K. Instance-based learning algorithms. Mach. Learn. 1991, 6, 37–66. [Google Scholar] [CrossRef]
Duda, R.O.; Hart, P.E. Pattern Classification and Scene Analysis; Wiley: New York, NY, USA, 1973; Volume 3. [Google Scholar]
Boser, B.E.; Guyon, I.M.; Vapnik, V.N. A training algorithm for optimal margin classifiers. In Proceedings of the Fifth Annual Workshop on Computational Learning Theory, Pittsburgh, PA, USA, 27–29 July 1992; pp. 144–152. [Google Scholar]
Witten, I.H.; Frank, E.; Hall, M.A.; Pal, C.J.; DATA, M. Practical machine learning tools and techniques. In Proceedings of the Data Mining, Las Vegas, NV, USA, 20–23 June 2005; Volume 2. [Google Scholar]

Figure 1. CEE chromosome structure.

Figure 2. NLCEE chromosome structure.

Figure 3. BIN-NLCEE gene—left part of the chromosome. The right part of the chromosome has the same structure as the left side.

Figure 4. Convergence comparison for the breast-w dataset—M class.

Figure 5. Convergence comparison for the breast-w dataset—B class.

Figure 6. Convergence comparison for the diabetes dataset—tested_negative class.

Figure 7. Convergence comparison for the diabetes dataset—tested_positive class.

Figure 8. Convergence comparison for the Parkinson’s dataset—healthy class.

Figure 9. Convergence comparison for the Parkinson’s dataset—PD class.

Figure 10. Convergence comparison for the mammography dataset—benign class.

Figure 11. Convergence comparison for mammography dataset—malignant class.

Figure 12. tested_negative class.

Figure 13. tested_positive class.

Table 1. Summary of the datasets used in this study.

Dataset	Attributes	Samples	Classes
Breast-W	32	569	Malignant (M), Benign (B)
Pima-Indians-diabetes	9	768	tested_negative, tested_positive
Parkinson’s	23	195	Healthy, PD (Parkinson’s Disease)
Mammography	5	961	Benign, Malignant
COVID-19	7	3158	tested_negative, tested_positive

Table 2. Breast-W dataset features.

Class	# of Samples	% of Samples
M	212	37.25%
B	357	62.75%
Total	569	100%

Table 3. Pima-Indians-diabetes dataset features.

Class	# of Samples	% of Samples
tested_negative	500	65.11%
tested_positive	268	34.89%
Total	768	100%

Table 4. Parkinson’s dataset features.

Class	# of Samples	% of Samples
healthy	48	24.62%
PD	147	75.38%
Total	195	100%

Table 5. Mammography dataset features.

Class	# of Samples	% of Samples
benign	516	53.69%
malignant	445	46.31%
Total	961	100%

Table 6. COVID-19 dataset features.

Class	# of Samples	% of Samples
tested_negative	2229	70.58%
tested_positive	929	29.41%
Total	3158	100%

Table 7. Fitness values for negative cases of each dataset.

	BIN-NLCEE	NLCEE	CEE
breast	98.45 ± 0.28	98.09 ± 0.33	97.46 ± 0.33
(B)
diabetes	83.34 ± 0.65	82.00 ± 0.65	80.92 ± 0.66
(tested_negative)
Parkinson’s	99.27 ± 0.33	98.53 ± 0.42	97.92 ± 0.51
(healthy)
mammography	85.86 ± 0.65	85.64 ± 0.64	85.23 ± 0.63
(benign)
COVID-19	64.56 ± 0.39	64.22 ± 0.41	63.52 ± 0.39
(tested_negative)

Table 8. Fitness values for positive cases of each dataset.

	BIN-NLCEE	NLCEE	CEE
breast	97.89 ± 0.35	97.18 ± 0.37	96.45 ± 0.42
(M)
diabetes	82.34 ± 0.64	80.93 ± 0.66	80.21 ± 0.72
(tested_positive)
Parkinson’s	97.34 ± 0.55	95.78 ± 1.12	96.13 ± 0.72
(PD)
mammography	84.23 ± 0.59	84.28 ± 0.64	83.78 ± 0.62
(malignant)
COVID-19	65.06 ± 0.39	63.62 ± 0.34	62.91 ± 0.37
(tested_positive)

Table 9. Fitness values of 100 executions, with distinct random seeds, for traditional classifiers.

	J48	IBK	Naive Bayes	SVM
breast	92.94 ± 0.17	95.07 ± 0.08	92.07 ± 0.06	97.02 ± 0.06
diabetes	70.66 ± 0.25	66.65 ± 0.16	72.11 ± 0.11	71.38 ± 0.10
Parkinson’s	79.70 ± 0.61	95.56 ± 0.22	76.80 ± 0.21	74.94 ± 0.21
mammography	81.57 ± 0.11	74.68 ± 0.15	77.80 ± 0.10	79.65 ± 0.09
COVID-19	56.39 ± 0.13	54.89 ± 0.11	57.69 ± 0.09	50.15 ± 0.07

Table 10. Fitness difference between BIN-NLCEE and traditional classifiers.

	J48		IBK		NB		SVM
	S	B	S	B	S	B	S	B
breast	4.43	5.47	2.39	3.25	5.41	6.23	0.46	1.28
diabe.	10.79	12.57	14.89	16.49	9.48	10.98	10.22	11.7
park.	16.48	18.8	1.01	2.55	19.78	21.3	21.64	23.16
mam.	1.96	3.36	8.81	10.29	5.74	7.12	3.9	5.26
cov.	8.15	9.19	9.67	10.67	6.89	7.85	14.45	15.37

Table 11. BIN-NLCEE rule set.

Dataset	Rule
Breast-w	IF((A1 < 0.0625 OR A1 ≥ 0.4219)AND(A2 ≥ 0.2657) AND(A3 ≥ 0.1719)AND(A7 ≥ 0.0625) AND(A8 ≥ 0.1563)AND(A18 < 0.5938) AND(A20 < 0.6875)AND(A21 < 0.7188) AND(A30 < 0.7813)) THEN M IF((A15 < 0.8438)AND(A16 < 0.3125) AND(A17 < 0.9375)AND(A20 ≥ 0.1875 OR A20 < 0.1719) AND(A21 < 0.1875)AND(A23 < 0.5938) AND(A26 < 0.5157)AND(A29 < 0.4688) AND(A30 < 0.75)AND(A31 < 0.7188 OR A31 ≥ 0.8282)) THEN B
Diabetes	IF((V1 < 0.25 OR V1 ≥ 0.5)AND(V2 < 0.8125) AND(V3 ≥ 0.0313)AND(V4 < 0.4219) AND(V8 < 0.25)) THEN tested_negative IF((V1 ≥ 0.0625)AND(V5 < 0.2969 OR V5 ≥ 0.625) AND(V6 < 0.625)AND(V8 ≥ 0.1094)) THEN tested_positive
Parkinson’s	IF((V1 < 0.5625)AND(V2 ≥ 0.375) AND(V3 < 0.375 OR V3 ≥ 0.875)AND(V16 ≥ 0.3125) AND(V20 < 0.4063)) THEN healthy IF((V1 < 0.6875 OR V1 ≥ 0.7969) AND(V18 ≥ 0.4375)) THEN PD
Mammography	IF((V2 < 0.7032 OR V2 ≥ 0.8438)AND(V3 < 0.9063) AND(V5 ≥ 0.1875)) THEN benign IF((V3 ≥ 0.8282)AND(V4 < 0.0782 OR V4 ≥ 0.4844) AND(V5 < 0.9219)) THEN malignant
COVID-19	IF (A3 ≥ 0.25 AND A5 ≥ 0.5313) THEN 1 IF (A3 < 0.3594 AND A5 < 0.7813 AND A6 < 0.4375) THEN 2

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Alves, A.H.d.S.; Neto, G.A.C.; Gomes, M.d.S.; Santos, L.D.; Bertarini, P.L.L.; do Amaral, L.R. Binary or Integer Chromosome: Which Is the Best Structure for Supervised Machine Learning Using Genetic Algorithms? Appl. Sci. 2025, 15, 2608. https://doi.org/10.3390/app15052608

AMA Style

Alves AHdS, Neto GAC, Gomes MdS, Santos LD, Bertarini PLL, do Amaral LR. Binary or Integer Chromosome: Which Is the Best Structure for Supervised Machine Learning Using Genetic Algorithms? Applied Sciences. 2025; 15(5):2608. https://doi.org/10.3390/app15052608

Chicago/Turabian Style

Alves, Alexandre Henrick da Silva, Guilherme Antonio Coelho Neto, Matheus de Souza Gomes, Líbia Diniz Santos, Pedro Luiz Lima Bertarini, and Laurence Rodrigues do Amaral. 2025. "Binary or Integer Chromosome: Which Is the Best Structure for Supervised Machine Learning Using Genetic Algorithms?" Applied Sciences 15, no. 5: 2608. https://doi.org/10.3390/app15052608

APA Style

Alves, A. H. d. S., Neto, G. A. C., Gomes, M. d. S., Santos, L. D., Bertarini, P. L. L., & do Amaral, L. R. (2025). Binary or Integer Chromosome: Which Is the Best Structure for Supervised Machine Learning Using Genetic Algorithms? Applied Sciences, 15(5), 2608. https://doi.org/10.3390/app15052608

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Binary or Integer Chromosome: Which Is the Best Structure for Supervised Machine Learning Using Genetic Algorithms?

Abstract

1. Introduction

2. GAs for Classification Task

2.1. Computational Evolutionary Environment (CEE)

2.2. Nonlinear Computational Evolutionary Environment (NLCEE)

3. Datasets

4. Binary Nonlinear Computational Evolutionary Environment (BIN-NLCEE)

4.1. Introduction

4.2. Individual Representation

4.3. Genetic Operators and Parameters

5. Results

6. Final Remarks

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI