1. Introduction
Breast cancer persists as a critical public health issue affecting women internationally, including populations in both Western nations and Asia. As reported by the GLOBOCAN project [
1], breast cancer accounts for approximately 25.4% of all invasive malignancies diagnosed in women and represents 15% of the total cancer burden among the female population. Globally, breast cancer ranks first in incidence and fifth in mortality among women. Annual age-standardized rates indicate that the incidence and mortality of breast cancer are 47.0 and 11.3 per 100,000 women, respectively. In other words, an estimated 2,295,832 women are diagnosed with breast cancer, and 684,996 women die from the disease each year worldwide.
In Taiwan, breast cancer incidence showed a marked rise, increasing from 28.4 to 64.28 cases per 100,000 women between 1995 and 2011 [
2]. During a similar period, from 1995 to 2013, the mortality rate also experienced an upward trend, climbing from 9.55 to 11.41 deaths per 100,000 women [
2]. By 2013, breast cancer had become the most commonly diagnosed malignancy among women in Taiwan and was the fourth leading cause of cancer-related deaths across the entire population [
3].
Conventional approaches to breast cancer diagnosis typically involve imaging techniques such as mammography, ultrasound, and MRI, followed by histopathological examination and biomarker analysis (e.g., ER, PR, and HER2 status). Clinical staging systems like TNM and genomic profiling tools often assess prognosis. While these methods are effective, they can be time-consuming, costly, and dependent on expert interpretation.
Recent progress in medical technologies has played a crucial role in reducing breast cancer mortality over the past decade. Enhanced diagnostic tools and therapeutic interventions have significantly improved patient outcomes. Current projections suggest that close to 97% of women receiving a breast cancer diagnosis have a survival rate of at least five years post-diagnosis. Additionally, advancements in early detection methods have led to a 10% to 15% increase in five-year survival rates across various types of breast tumors [
4]. As a result, the development of accurate and efficient predictive models for early breast cancer diagnosis remains a pressing and valuable objective [
5].
The rapid evolution of medical sensor technologies and hospital information systems has fundamentally transformed the processes of acquiring, managing, and applying healthcare data. This transformation has led to explosive growth in electronic medical records, forming extensive healthcare databases. There is a critical need for sophisticated analytical techniques capable of efficiently performing tasks such as data retrieval, querying, analysis, and knowledge discovery to exploit the wealth of information contained within these vast datasets.
Data mining involves uncovering practical knowledge from large-scale datasets and is often associated with automated discovery techniques. It follows a systematic procedure designed to extract implicit patterns, associations, and trends—insights typically too subtle or complex to be revealed through manual analysis. This extracted information is then used to construct predictive models. Data mining is a relatively recent technological advancement emerging as a distinct discipline in 1994 [
6]. Owing to its wide-ranging applicability across various domains, it has garnered significant attention from researchers and medical professionals [
7].
Enhancing data mining and knowledge discovery techniques for efficiently extracting breast cancer patterns has emerged as a critical research focus. In this context, interpretable machine learning models, particularly rule-based systems, represent a promising addition to conventional diagnostic methods. They support automated classification while ensuring transparency, which remains vital for effective clinical decision making. This study proposes a novel classification technique for extracting malignant prediction rules from training datasets containing numerical and binary nominal attributes. The classification technique introduced in this study facilitates the discovery of breast cancer patterns by integrating a real-coded genetic algorithm, an adaptive directed mutation operator, and a two-level malignant-rule-mining process. To achieve this, two publicly available datasets from the United States and Taiwan are employed to derive predictive classification rules to improve diagnostic accuracy and facilitate more informed medical judgment.
2. Related Work
2.1. Genetic Algorithm Techniques
Evolutionary algorithms utilize randomly generated populations of candidate solutions to explore the search space. Through applying stochastic search strategies, they are widely acknowledged as practical tools for global optimization. Evolutionary algorithms demonstrate superior performance compared with traditional gradient-based approaches, particularly when addressing complex, high-dimensional optimization challenges. Three population-based heuristic methodologies constitute the domain of evolutionary algorithms, as follows: genetic algorithms (GAs), evolutionary programming, and evolutionary strategies. Genetic algorithms are the most extensively applied evolutionary algorithm [
8].
GAs are probabilistic search techniques developed to locate optimal solutions in expansive and intricate search domains. Distinguished by their adaptive mechanisms, GAs offer several performance advantages compared with conventional search approaches [
9,
10]. Drawing inspiration from Darwinian evolutionary theory, notably the mechanisms of natural selection and the survival of the fittest [
9,
10], these algorithms simulate biological evolution. Individuals exhibiting superior fitness are more prone to passing on their genetic characteristics in this process. Rather than relying on a single initial estimate, GAs initiate the search process using a diverse population of candidate solutions spread across the problem space. The evolutionary process is driven by three fundamental genetic operators—selection, crossover, and mutation—which iteratively refine the population and steer the search toward a globally optimal solution [
9,
10].
Conventional implementations of GAs often represent decision variables using binary encoding, known as the binary-coded genetic algorithm (BCGA) [
9,
10]. This method effectively solves low- to medium-complexity problems, where high numerical precision is not a critical requirement. Nonetheless, BCGA encounters significant computational load and memory consumption challenges when applied to high-dimensional optimization tasks that demand finer resolution [
11]. Real-coded genetic algorithms (RCGAs) have been introduced to mitigate these limitations, where decision variables are directly encoded as real numbers. Empirical studies have shown that RCGAs outperform their binary-coded counterparts, particularly when applied to medical data mining [
12]. Alhijawi and Awajan comprehensively reviewed genetic algorithms, detailing their theoretical foundations, genetic operators, solution strategies, and diverse real-world applications. They highlighted recent advancements and integration with neural networks for optimization tasks [
13]. Recent advancements have improved their efficiency through enhanced mutation techniques [
14] and hybrid integration with machine learning [
15] and deep learning [
16]. Their adaptability and optimization potential make GAs an essential tool in computational research.
2.2. Data Mining Applications in Healthcare
Healthcare transactions generate vast quantities of highly complex data that exceed the capacity of traditional methods for effective processing and analysis. Data mining provides the requisite methodologies and technologies to convert these voluminous datasets into practical insights, thus supporting evidence-based decision making. Data mining has shown considerable promise in the healthcare industry, including assessing treatment effectiveness, healthcare administration, patient relations management, and identifying fraudulent activities and misuse [
17]. As clinical data continue to grow at an exponential rate, clinical databases have become a key area where data mining is not only beneficial but also indispensable. As a result, data mining has become an increasingly vital tool in healthcare, playing a crucial role as an intelligent aid for diagnosis [
18].
Data mining describes identifying novel patterns and trends in large datasets and harnessing these insights to create predictive models [
19]. By examining vast amounts of data, data mining can uncover patterns that may be too intricate or subtle for human analysts to detect [
20]. Data mining primarily aims to discern correlations and patterns within data that are valid, novel, potentially useful, and comprehensible [
21]. In contrast to traditional statistical analysis and data visualization techniques, data mining actively enables the discovery of new insights through machine learning algorithms, a key component of the broader field of artificial intelligence [
7]. Data mining methodologies are generally categorized according to their intended function, encompassing descriptive and visual analysis, relationship and grouping analysis, and predictive modeling tasks like classification and estimation. Ultimately, these functionalities contribute to the broader domain of predictive modeling. Among these, predictive modeling, particularly classification, is a highly prevalent task. Classification involves predicting a categorical target variable by assigning new instances to one of several predefined classes. As a core pattern recognition component, classification has been approached through various methodologies. In healthcare, classification analysis is critical in disease categorization within clinical databases.
Previous studies frequently employed logistic and multinomial regression techniques to develop classification models. However, real-world classification problems often exhibit high levels of nonlinearity and complexity [
22]. Consequently, contemporary research has investigated diverse nonlinear and sophisticated machine learning methodologies for extracting breast cancer patterns. These encompass approaches such as decision trees, naïve Bayes classifiers, artificial neural networks, support vector machines, fuzzy logic systems, genetic algorithms, particle swarm optimization, and integrated techniques combining learning and data mining strategies. Comprehensive surveys of various classification methods applied to breast cancer datasets have been presented in prior studies [
5,
22,
23,
24]. Experimental findings from these studies indicate that machine learning techniques consistently outperform traditional statistical methods regarding predictive accuracy. Consequently, increasing attention has been directed toward classification risk modeling using soft computing techniques.
Although evolutionary computing can generate interpretable classification rules with superior performance, it typically demands greater computational resources than conventional black-box approaches [
22,
23]. Enhancing the efficiency of rule extraction systems by developing high-performance evolutionary algorithms has become a critical area of research for effective knowledge discovery from large datasets. This study proposes an Adaptive Directed Mutation Real-Coded Genetic Algorithm (ADM-RCGA) [
25] as a data mining method for deriving diagnostic knowledge rules for breast tumors. The derived rules aim to enhance disease control and management.
Genetic algorithms (GAs) have recently been increasingly applied in medical issues, providing efficient solutions for complex medical problems. In classifying diabetes mellitus, GAs have assisted in diagnostic decision making, showcasing their ability to optimize complex data analysis [
26]. They have also demonstrated their effectiveness in optimizing hospital patient scheduling, significantly improving service efficiency [
27]. Moreover, GAs have been applied in medical image segmentation and classification, enhancing feature selection and accuracy [
28,
29]. These applications demonstrate GAs’ versatility in medical research, making them a valuable tool for optimization, diagnosis, and decision making.
3. Methodology
3.1. ADM-RCGA
The effectiveness of a GA largely depends on the design of efficient search operators, which determine the system’s optimal performance. A key challenge in genetic algorithm (GA) performance is premature convergence. Addressing this challenge and avoiding algorithmic entrapment in local optima necessitates the mutation operator, which is crucial for introducing novel solutions and maintaining the diversity of the population during the search. However, this benefit often comes at the cost of slowing the convergence rate. Few studies have focused on designing novel mutation operators specifically for real-coded genetic algorithms (RCGAs) [
25,
30]. To address this gap, researchers [
25] proposed an adaptive directed mutation (ADM) operator, which has been effectively utilized to tackle 41 challenging optimization challenges involving functions.
The ADM operator was designed to incorporate random mutation while combining the effects of crossover and unsystematic search mechanisms to prevent premature convergence or concentration around specific chromosomes. This operator enhances solution diversity within the population and improves the likelihood of discovering optimal solutions. Previously obtained solutions guide the search for new candidate points and follow the adaptive direction of the gradient, hence the term “adaptive directed mutation”. The evolutionary trajectory of each individual’s fitness value tends to converge along the gradient direction. To describe this trend, we define the fitness differences of chromosome x across three successive generations, namely,
(t − 2),
(t − 1), and
(t), as follows:
Here, x = {x
1, x
2,…x
k,…x
N} denotes a chromosome, and f(x(t)) is its fitness at generation t. The gene-wise variation in the k-th dimension for chromosome x over the three generations is given by the following:
By incorporating Equations (1) through (4), the next generation value of gene xk is updated through the following expression:
where
and
refer to the upper and lower bounds of the gene value, respectively, and pm denotes the adaptive mutation probability.
In Equation (5), the function
(.) acts as an acceleration mechanism that regulates the directed mutation process. This study introduces four guided mutation strategies, applicable to any chromosome, each based on nine different evolutionary trends. These strategies are the following: (1) directional small-scale mutation, (2) random small-scale mutation, (3) random medium-scale mutation, and (4) random large-scale mutation. A comprehensive explanation of the ADM mechanism and its application can be found in Tan and Tseng [
25].
The evolutionary process of the ADM-RCGA involves the following five key genetic operators: stochastic universal sampling (SUS) for selection, a variant of the blend crossover (BLX-α) combined with a crossover mask for recombination, an enhanced ADM operator for mutation, a multi-elite mechanism for elitism, and a replacement strategy. For a more in-depth explanation of these operators, refer to Tan and Tseng [
25].
The ADM operator leverages local directional search strategies with adaptive random search mechanisms to substantially improve the performance of real-coded genetic algorithms (RCGAs) in achieving global optima and expediting convergence. Prior empirical findings have established that the ADM-RCGA [
25] is fast, accurate, and reliable, outperforming six state-of-the-art evolutionary algorithms. Within this study, the ADM-RCGA is applied to the problem of rule extraction for knowledge discovery.
3.2. Discovery of Knowledge Rules Using ADM-RCGA
Data mining techniques typically represent the discovered knowledge as if–then prediction rules. This approach provides a high-level, symbolic representation of knowledge, making interpreting and understanding the extracted information easier [
5,
22,
31,
32,
33]. This study introduces the ADM-RCGA as the primary data mining method for efficiently and swiftly generating knowledge rules for breast cancer classification.
3.2.1. Two-Level Malignant-Rule Mining Approach
In general, the mechanism of extracting multiple rules is commonly employed in rule-based mining systems [
5,
22,
31] to enhance prediction accuracy. Within the rule-mining procedure, after a rule has been extracted, the data points still incorrectly classified by the preceding rule(s) will be used to guide the extraction of the following rule. This iterative process continues until all data are successfully classified [
5,
31]. However, relying on instances of false positives to mine subsequent rules may increase the risk of rule conflicts and negatively impact the overall performance of the classification system.
In medical applications, extracting applicable “malignant” rules to distinguish from “benign” cases is essential. A two-level hierarchical malignant-rule-mining approach is proposed to enhance the overall accuracy in the rule-mining procedure. The two-level malignant-rule framework is easier to understand by human intuition; thus, as shown in
Figure 1, it can be used in mining ordered two malignant rules for predicting breast cancer diagnosis.
Figure 2 illustrates the flow diagram of the proposed ADM-RCGA-based malignant-rule-mining approach. Some critical processes for extracting malignant rules from the training dataset involve rule representation, semi-automatic feature selection, and fitness evaluation of rules, which are explained as follows. In our method for rule mining, after the extraction of a malignant rule, only the data that were not classified by the preceding malignant rule(s) will be leveraged to derive the subsequent malignant rule.
3.2.2. Rules Representation and Semi-Automatic Feature Selection
Genetic representation encompasses a set of parameters tailored to solve particular problems. Various genetic representations provide distinct approaches to these challenges. If the user defines m as the number of selected feature variables, the rule structure will incorporate m variables. This illustrates the general structure of the rule produced by ADM-RCGA, which is as follows:
If all m conditions are met, the model will categorize the breast cancer type as malignant. Here, V1 to Vm represent the features, the symbol “>” denotes the greater-than operator, the symbol “≦” indicates the less than or equal to operator, and C1 to Cm represent the corresponding threshold values. The feature variables V1 to Vm, the operator “>” or “≦”, and the cut-off values C1 to Cm are determined through the ADM-RCGA search process. Due to the fully automatic feature selection method, a large amount of CPU time is required in all EC-based mining systems [
22]. In the proposed approach, the user uses a semi-automatic feature selection mechanism by setting the number of significant features in one rule, as shown in
Table 1. This method is a good compromise strategy to significantly increase the overall computational efficiency of the rule-mining procedure.
To our knowledge, an RCGA can effectively handle numeric attributes but not nominal ones. This research adapts the proposed ADM-RCGA rule-mining framework to accommodate binary nominal attributes by introducing the operator detailed below. For example, both HSV-1 and HHV-8 are binary attributes (0: negative/1: positive); a rule may be in the form of “IF HSV-1 > 0.6 and HHV-8 ≦ 0.3, THEN diagnosis is breast cancer”. Since the range of “>0.6” contains one and the range of “≦0.3” contains 0, the final, concise rule becomes “IF HSV-1 = 1 (positive) and HHV-8 = 0 (negative), THEN diagnosis is breast cancer”.
3.2.3. Fitness Evaluation of Rules
A fitness function serves to quantify the merit of individual parameter sets within a given population, thereby guiding genetic algorithms (GAs) in discerning which sets exhibit a higher likelihood of persistence and refinement across generations. Carefully selecting a suitable fitness function is pivotal in practical optimization endeavors. The current research employs the GA methodology to pinpoint the rule that demonstrates superior performance from the existing pool of candidate rules. To concurrently assess the precision and scope of every rule, the rule-mining strategy presented herein utilizes a multi-component fitness measure, formulated as outlined below:
where
measures are taken to measure the coverage and accuracy of a rule [
32,
33]. The fitness function aims to optimize both predictive precision and the extent of data encompassed, acting as a robust selection criterion to identify regulations that attain the highest hit rate across the complete dataset.
4. Experimental Results
This investigation employed two distinct medical datasets to assess the efficacy of the proposed ADM-RCGA framework in discerning patterns within breast cancer. Furthermore, the developed model underwent a comparative analysis against existing models from prior research to evaluate its relative performance. Another aim was to ascertain the model’s resilience across varied medical datasets, substantiating its potential for broader applicability. All experimentation was conducted on a system featuring an Intel Pentium E2180 2 GHz central processing unit, complemented by 2 GB of random-access memory, operating under the Windows XP environment.
4.1. Breast Cancer Datasets
Two breast cancer diagnosis datasets originated from the American and Taiwanese populations. Data for the Wisconsin Breast Cancer Database (WBCD) were gathered at the University of Wisconsin [
34] and contributed to the UCI Machine Learning Repository in 1991. The Chang-Shan Breast Cancer (CSBC) dataset was compiled at Chang-Shan Medical University and published in 2007 [
35] and 2009 [
33]. The classification task in both datasets involved distinguishing between benign and malignant breast tumors based on various physical or genetic attributes of cells or DNA. A brief description of these datasets’ characteristics is provided below.
4.1.1. The Wisconsin Breast Cancer Database (WBCD)
The WBCD dataset, a compilation of 699 cases derived from fine needle aspirates of human breast tissue, as detailed in
Table 2, encompasses nine numerical characteristics per instance (excluding the sample identification number). The dataset evaluates several diagnostic features, such as clump thickness, cell size and shape consistency, marginal adhesion, size of individual epithelial cells, presence of bare nuclei, chromatin texture, visibility of nucleoli, and mitotic activity. Each characteristic is assessed using an integer scale ranging from 1 to 10, where lower scores (close to 1) typically indicate benign features, and higher scores (close to 10) reflect more severe or malignant traits. Sixteen instances in the dataset contain missing values. Following the approach proposed by Yeh et al. [
5], these missing values were imputed using the most frequently occurring value for each respective attribute. Each instance is labeled as benign or malignant, with the dataset comprising 458 benign cases (65.52%) and 241 malignant cases (34.48%).
Table 2 also presents descriptive statistics and the results of t-tests for the nine attributes, illustrating their significance in distinguishing between benign and malignant tumors. All nine features demonstrate statistical significance at the 95% confidence level, confirming their relevance for breast cancer classification in the WBCD dataset.
4.1.2. The Chung–Shan Breast Cancer (CSBC) Database
The CSBC dataset, a collection of 80 cases detailed in
Table 3, was compiled using polymerase chain reaction (PCR) and Southern blotting techniques to analyze human breast tissue specimens for β-globin. Each case includes five nominal attributes representing the presence of the following DNA viruses: HSV-1, EBV, CMV, HPV, and HHV-8. The values for these attributes are categorical, denoting either a positive (detected) or negative (not detected) result for each virus. All 80 specimens underwent β-globin screening to ascertain the presence of DNA viruses. Each sample is further categorized into one of the following two classes: fibroadenoma (benign) or breast cancer (malignant). Within this dataset, 52 (65%) of the tissue samples were classified as non-familial invasive ductal breast cancer, while 28 (35%) were identified as mammary fibroadenomas.
Table 3 presents the descriptive statistics for each class within the CSBC dataset. Furthermore, to highlight the significance of the five DNA viruses,
Table 3 also includes the
p-values derived from a chi-squared (χ2) test. The statistical results presented in
Table 3 clearly indicate that only HSV-1, CMV, and HHV-8 exhibit significant association, exceeding a 95% confidence level, in distinguishing between fibroadenoma (benign) and breast cancer (malignant) cases in the CSBC dataset.
The acquired samples were divided into the following two distinct subsets: a training set and a testing set. The training set served as the basis for developing classification rules, whereas the testing set—maintained independently of the rule induction process—was employed to evaluate the efficacy of the ADM-RCGA-based classifier. To facilitate a direct comparison with prior research [
5,
33], this study adhered to the same simulation parameters. Specifically, for the WBCD dataset, 466 instances (representing 66.7% of the total) were assigned to the training set, with the remaining 233 instances (33.3%) were allocated to the testing set. In the case of the CSBC dataset, 64 instances (80%) were designated for training, and 16 instances (20%) were reserved for testing purposes.
4.2. Comparative Results on the WBCD Dataset
Table 4 summarizes the experimental outcomes obtained by applying the ADM-RCGA to the WBCD dataset. To assess the proposed method’s effectiveness, the top 10 results from 50 independent executions are presented. The experimental setup maintained consistent configurations throughout, with the evolutionary process initialized using 30 individuals per population, capped at 100 generations. The algorithm employed a crossover rate set to 60%, a mutation rate of 10%, and retained 20% of the top-performing individuals through elitism. A criterion of 20 consecutive generations governed the evolutionary algorithm’s termination without any change in the best fitness value. In Rule 1, the highest classification accuracy achieved was 0.9292 for the training set and 0.9657 for the testing set. The corresponding average accuracies were 0.9272 and 0.9554, respectively. When misclassified instances remained, a new rule was generated using the hierarchical rule-mining strategy. For Rule 2, the best accuracy improved to 0.9506 and 0.9914 on the training and testing sets, respectively, with average accuracies of 0.9500 and 0.9867. The average CPU execution times were 33.64 s for training and 2.25 s for testing. These results demonstrate that the proposed ADM-RCGA framework offers a fast and accurate breast cancer pattern mining approach.
Table 5 summarizes the optimal decision rules generated by ADM-RCGA, DPSO, and BCGA for classifying instances in the WBCD dataset. The first rule derived using the proposed ADM-RCGA method was as follows: “IF (Uniformity of Cell Shape > 2.79 AND Bare Nuclei > 2.94), THEN diagnosis is malignant”. Upon iterating the mining process, a second rule was extracted as follows: “IF (Uniformity of Cell Size > 3.88 AND Bare Nuclei ≤ 2.36), THEN diagnosis is malignant”. Notably, the proposed system achieved a classification accuracy of 99.14% using only three features and two decision rules.
Table 5 also presents the comparative results of the DPSO and BCGA approaches. The ADM-RCGA outperformed both alternatives, yielding an accuracy improvement of 2.145% over BCGA and 0.43% over DPSO. Furthermore, as shown in
Table 6, the proposed ADM-RCGA system attained a sensitivity of up to 100% and a specificity of up to 98.81%. While the Type II error rates were comparable between ADM-RCGA and DPSO, ADM-RCGA exhibited a slight advantage in reducing Type I errors. To further validate the effectiveness of the proposed approach, a comparison with previously published rule extraction methods [
5,
25,
31,
36,
37,
38,
39,
40] was conducted.
Table 7 demonstrates that the ADM-RCGA-based diagnostic system attains higher classification accuracy on the WBCD dataset than existing approaches, as evidenced by both the holdout and three-fold cross-validation methods.
4.3. Comparative Results on the CSBC Dataset
Table 8 displays the outcomes of implementing the proposed ADM-RCGA method on the CSBC dataset. The top 10 outcomes from a total of 50 experimental runs are reported to evaluate the model’s effectiveness. The training and testing accuracies were 0.83 and 0.70 in the first extracted rule, respectively. The second rule yielded improved accuracies of 0.85 for the training set and 0.90 for the testing set. The computation took an average of 2.59 s during training and 0.14 s during testing, reflecting the model’s efficiency in terms of runtime.
To strengthen the evidence supporting the reliability of the ADM-RCGA approach, a performance comparison was carried out against the BCGA model, previously introduced by Tseng and Laio [
32], using the CSBC dataset. The comparative performance results are detailed in
Table 9 and
Table 10. Our findings indicate that employing the ADM-RCGA approach yielded an improvement in accuracy, raising the performance from 77% (achieved by the BCGA) to as much as 90%, representing an enhancement of up to 13 percentage points, sensitivity from 88.24% (BCGA) by as much as 11 percentage points, and specificity from 33.33% (BCGA) by up to 33.34 percentage points. Using the proposed ADM-RCGA approach, Type II performance errors can be improved from 11.76% (BCGA) to 5.88%, and Type I performance errors can be improved from 66.67% (BCGA) to 33.33%.
Table 11 lists the best decision rules derived from the ADM-RCGA and BCGA for the CSBC classification. In the proposed ADM-RCGA approach, the classification rule 1 was “IF (HSV-1 is negative) THEN malignant”. After the proposed hierarchical rule-mining process was followed, the classification rule 2 was “IF (HSV-1 is positive and HHV-8 is negative) THEN malignant”. Thus far, this study has applied only two DNA viruses and two rules to improve the accuracy by up to 90%.
Table 8,
Table 9,
Table 10 and
Table 11 show that the ADM-RCGA diagnostic system we developed achieves better classification accuracy on the CSBC classification than BCGA [
32].
Including the CSBC dataset in this study provides meaningful complementary value. First, it demonstrates the ADM-RCGA framework’s ability to handle binary nominal attributes, common in molecular and virology-based diagnostics. Second, the CSBC dataset originates from a Taiwanese patient population, offering a regional perspective that complements the U.S.-based WBCD dataset and supports the method’s applicability across diverse clinical settings. Third, the extracted rules from CSBC (e.g., HSV-1 and HHV-8 status) reflect biologically and clinically relevant patterns, showcasing the algorithm’s potential to uncover interpretable insights beyond conventional cytological features. By validating ADM-RCGA on both numerical and categorical datasets from different populations, this study highlights the model’s flexibility, robustness, and potential for broader clinical integration.
5. Discussion
Table 12 compares the testing results of CatBoost [
41] and XGBoost [
42] on the WBCD dataset with the proposed ADM-RCGA method to perform a fair comparison with state-of-art ensemble tree-based classifiers. The results indicate that while XGBoost and CatBoost deliver excellent classification accuracy with significantly lower computational time, the ADM-RCGA method achieves the highest overall accuracy. More importantly, the ADM-RCGA generates explicit and interpretable decision rules tailored for clinical application, offering a key advantage in medical diagnostics where transparency and explainability are critical. Although the ADM-RCGA incurs a higher computational cost due to its evolutionary search and rule refinement mechanisms, this trade-off is justified by the enhanced clinical interpretability and diagnostic utility of the extracted rules, which are advantages not typically afforded by ensemble tree-based models.
As shown in
Figure 3 and
Figure 4, the ADM-RCGA achieved the highest classification accuracy (99.14%) among the three methods, while XGBoost and CatBoost both reached 98.57%. However, the ADM-RCGA requires more computational time due to its evolutionary optimization process. These visualizations clearly demonstrate the trade-off between accuracy and computational efficiency, as well as highlight the robustness and interpretability advantages of the ADM-RCGA in clinical applications.
In summary, the comparative advantages of the proposed ADM-RCGA framework over other classification methods, particularly in the context of medical data mining. The key advantages of ADM-RCGA include the following:
- 1.
Rule Interpretability: Unlike black-box models such as neural networks or ensemble tree methods (e.g., XGBoost and CatBoost), the ADM-RCGA generates explicit if–then rules that are concise and clinically interpretable. This is crucial in healthcare, where transparency and explainability are essential for physician trust and decision making.
- 2.
Two-Level Rule-Mining Strategy: The hierarchical rule extraction process allows for the ADM-RCGA to refine classification by focusing on misclassified instances, improving accuracy and coverage while minimizing rule conflict and redundancy.
- 3.
Adaptability to Mixed Data Types: the ADM-RCGA can handle numerical and binary nominal attributes. It suits diverse clinical datasets, such as cytological measurements (WBCD) and virology-based features (CSBC).
- 4.
Competitive Accuracy: As shown in our comparative experiments (
Table 7 and
Table 12), the ADM-RCGA achieves higher or comparable accuracy to state-of-the-art methods, including XGBoost and CatBoost, while offering rule transparency.
- 5.
Clinical Relevance: The extracted rules (e.g., “IF HSV-1 is negative THEN malignant”) align with known medical insights and can be directly used in diagnostic workflows, enhancing their practical utility.
6. Conclusions
The early identification of breast cancer represents the paramount strategy for improving the long-term survival prospects of affected individuals. In this investigation, we introduced a two-tiered approach for extracting malignant rules to uncover significant insights from comprehensive medical datasets, employing an enhanced RCGA model. The derived rules were successfully implemented to classify breast cancer risk within the WBCD and CSBC datasets. Our experimental results suggest that the ADM-RCGA method reached peak performance, achieving 99.14% when applied to the WBCD dataset and 90.00% in the case of the CSBC dataset. These outcomes are particularly encouraging when juxtaposed with previously documented rule-based classification techniques for breast cancer pattern discovery.
The results of our experiments robustly indicate that the ADM-RCGA methodology suggested in this study can markedly aid physicians in attaining accurate diagnoses and demonstrate considerable potential for its role in the clinical diagnosis of breast cancer. Furthermore, the two-level malignant-rule-mining process proved to be particularly insightful. For the WBCD dataset, the extracted rules included: R1: “IF (Uniformity of Cell Shape > 2.79 AND Bare Nuclei > 2.94) THEN malignant” and R2: “IF (Uniformity of Cell Size > 3.88 AND Bare Nuclei ≤ 2.36) THEN malignant”. In the case of the CSBC dataset, the derived rules were: R1: “IF (HSV-1 is negative) THEN malignant” and R2: “IF (HSV-1 is positive AND HHV-8 is negative) THEN malignant”. These transparent and interpretable decision rules offer valuable guidance to physicians before a definitive diagnosis is made. The experimental outcomes also support the effectiveness of the proposed ADM-RCGA method in extracting pertinent malignant rules from the training dataset, utilizing both numerical and binary nominal attributes.
To evaluate the efficacy of the proposed ADM-RCGA method, several rule extraction techniques from prior research were implemented for performance comparison. The experimental outcomes reveal that the ADM-RCGA method significantly outperforms all the previously employed approaches. In future research, we plan to systematically employ stratified k-fold cross-validation and oversampling and undersampling techniques to address small and imbalanced datasets such as CSBC. Furthermore, we will incorporate F1 score, precision, recall, and AUC as key evaluation metrics to ensure a comprehensive performance assessment. In addition, we plan to integrate rule-pruning mechanisms and conflict-detection algorithms to improve the interpretability and consistency of the rule set further.