A Two-Level Rule-Mining Approach to Classify Breast Cancer Patterns Using Adaptive Directed Mutation and Genetic Algorithm

Wu, Hui-Ching; Tseng, Ming-Hseng

doi:10.3390/eng6070154

Open AccessArticle

A Two-Level Rule-Mining Approach to Classify Breast Cancer Patterns Using Adaptive Directed Mutation and Genetic Algorithm

by

Hui-Ching Wu

¹

and

Ming-Hseng Tseng

^2,3,*

¹

Department of Medical Sociology and Social Work, Chung Shan Medical University, Taichung 402, Taiwan

²

Department of Medical Informatics, Chung Shan Medical University, Taichung 402, Taiwan

³

Information Technology Office, Chung Shan Medical University Hospital, Taichung 402, Taiwan

^*

Author to whom correspondence should be addressed.

Eng 2025, 6(7), 154; https://doi.org/10.3390/eng6070154

Submission received: 19 May 2025 / Revised: 23 June 2025 / Accepted: 4 July 2025 / Published: 7 July 2025

(This article belongs to the Special Issue Advanced Artificial Intelligence Techniques for Disease Prediction, Diagnosis and Management)

Download

Browse Figures

Versions Notes

Abstract

Breast cancer represents a significant public health concern in both Western countries and Asia. Accurate and early detection is critical to improving long-term patient survival. For physicians to understand the classification and decision rules and to evaluate their results, it is preferable to use white box approaches to develop prediction models. This paper proposes a novel classification technique for extracting malignant prediction rules from training datasets containing numerical and binary nominal attributes. The classification technique introduced in this study facilitates the discovery of breast cancer patterns by integrating a real-coded genetic algorithm, an adaptive directed mutation operator, and a two-level malignant-rule-mining process. The experimental results, compared with existing rule-based methods from previous studies, demonstrate that the proposed approach generates simple and interpretable decision rules and effectively identifies patterns that lead to accurate breast cancer classification.

Keywords:

real-coded genetic algorithm; adaptive directed mutation; two-level malignant rules; breast cancer

1. Introduction

Breast cancer persists as a critical public health issue affecting women internationally, including populations in both Western nations and Asia. As reported by the GLOBOCAN project [1], breast cancer accounts for approximately 25.4% of all invasive malignancies diagnosed in women and represents 15% of the total cancer burden among the female population. Globally, breast cancer ranks first in incidence and fifth in mortality among women. Annual age-standardized rates indicate that the incidence and mortality of breast cancer are 47.0 and 11.3 per 100,000 women, respectively. In other words, an estimated 2,295,832 women are diagnosed with breast cancer, and 684,996 women die from the disease each year worldwide.

In Taiwan, breast cancer incidence showed a marked rise, increasing from 28.4 to 64.28 cases per 100,000 women between 1995 and 2011 [2]. During a similar period, from 1995 to 2013, the mortality rate also experienced an upward trend, climbing from 9.55 to 11.41 deaths per 100,000 women [2]. By 2013, breast cancer had become the most commonly diagnosed malignancy among women in Taiwan and was the fourth leading cause of cancer-related deaths across the entire population [3].

Conventional approaches to breast cancer diagnosis typically involve imaging techniques such as mammography, ultrasound, and MRI, followed by histopathological examination and biomarker analysis (e.g., ER, PR, and HER2 status). Clinical staging systems like TNM and genomic profiling tools often assess prognosis. While these methods are effective, they can be time-consuming, costly, and dependent on expert interpretation.

Recent progress in medical technologies has played a crucial role in reducing breast cancer mortality over the past decade. Enhanced diagnostic tools and therapeutic interventions have significantly improved patient outcomes. Current projections suggest that close to 97% of women receiving a breast cancer diagnosis have a survival rate of at least five years post-diagnosis. Additionally, advancements in early detection methods have led to a 10% to 15% increase in five-year survival rates across various types of breast tumors [4]. As a result, the development of accurate and efficient predictive models for early breast cancer diagnosis remains a pressing and valuable objective [5].

The rapid evolution of medical sensor technologies and hospital information systems has fundamentally transformed the processes of acquiring, managing, and applying healthcare data. This transformation has led to explosive growth in electronic medical records, forming extensive healthcare databases. There is a critical need for sophisticated analytical techniques capable of efficiently performing tasks such as data retrieval, querying, analysis, and knowledge discovery to exploit the wealth of information contained within these vast datasets.

Data mining involves uncovering practical knowledge from large-scale datasets and is often associated with automated discovery techniques. It follows a systematic procedure designed to extract implicit patterns, associations, and trends—insights typically too subtle or complex to be revealed through manual analysis. This extracted information is then used to construct predictive models. Data mining is a relatively recent technological advancement emerging as a distinct discipline in 1994 [6]. Owing to its wide-ranging applicability across various domains, it has garnered significant attention from researchers and medical professionals [7].

Enhancing data mining and knowledge discovery techniques for efficiently extracting breast cancer patterns has emerged as a critical research focus. In this context, interpretable machine learning models, particularly rule-based systems, represent a promising addition to conventional diagnostic methods. They support automated classification while ensuring transparency, which remains vital for effective clinical decision making. This study proposes a novel classification technique for extracting malignant prediction rules from training datasets containing numerical and binary nominal attributes. The classification technique introduced in this study facilitates the discovery of breast cancer patterns by integrating a real-coded genetic algorithm, an adaptive directed mutation operator, and a two-level malignant-rule-mining process. To achieve this, two publicly available datasets from the United States and Taiwan are employed to derive predictive classification rules to improve diagnostic accuracy and facilitate more informed medical judgment.

2. Related Work

2.1. Genetic Algorithm Techniques

Evolutionary algorithms utilize randomly generated populations of candidate solutions to explore the search space. Through applying stochastic search strategies, they are widely acknowledged as practical tools for global optimization. Evolutionary algorithms demonstrate superior performance compared with traditional gradient-based approaches, particularly when addressing complex, high-dimensional optimization challenges. Three population-based heuristic methodologies constitute the domain of evolutionary algorithms, as follows: genetic algorithms (GAs), evolutionary programming, and evolutionary strategies. Genetic algorithms are the most extensively applied evolutionary algorithm [8].

GAs are probabilistic search techniques developed to locate optimal solutions in expansive and intricate search domains. Distinguished by their adaptive mechanisms, GAs offer several performance advantages compared with conventional search approaches [9,10]. Drawing inspiration from Darwinian evolutionary theory, notably the mechanisms of natural selection and the survival of the fittest [9,10], these algorithms simulate biological evolution. Individuals exhibiting superior fitness are more prone to passing on their genetic characteristics in this process. Rather than relying on a single initial estimate, GAs initiate the search process using a diverse population of candidate solutions spread across the problem space. The evolutionary process is driven by three fundamental genetic operators—selection, crossover, and mutation—which iteratively refine the population and steer the search toward a globally optimal solution [9,10].

Conventional implementations of GAs often represent decision variables using binary encoding, known as the binary-coded genetic algorithm (BCGA) [9,10]. This method effectively solves low- to medium-complexity problems, where high numerical precision is not a critical requirement. Nonetheless, BCGA encounters significant computational load and memory consumption challenges when applied to high-dimensional optimization tasks that demand finer resolution [11]. Real-coded genetic algorithms (RCGAs) have been introduced to mitigate these limitations, where decision variables are directly encoded as real numbers. Empirical studies have shown that RCGAs outperform their binary-coded counterparts, particularly when applied to medical data mining [12]. Alhijawi and Awajan comprehensively reviewed genetic algorithms, detailing their theoretical foundations, genetic operators, solution strategies, and diverse real-world applications. They highlighted recent advancements and integration with neural networks for optimization tasks [13]. Recent advancements have improved their efficiency through enhanced mutation techniques [14] and hybrid integration with machine learning [15] and deep learning [16]. Their adaptability and optimization potential make GAs an essential tool in computational research.

2.2. Data Mining Applications in Healthcare

Healthcare transactions generate vast quantities of highly complex data that exceed the capacity of traditional methods for effective processing and analysis. Data mining provides the requisite methodologies and technologies to convert these voluminous datasets into practical insights, thus supporting evidence-based decision making. Data mining has shown considerable promise in the healthcare industry, including assessing treatment effectiveness, healthcare administration, patient relations management, and identifying fraudulent activities and misuse [17]. As clinical data continue to grow at an exponential rate, clinical databases have become a key area where data mining is not only beneficial but also indispensable. As a result, data mining has become an increasingly vital tool in healthcare, playing a crucial role as an intelligent aid for diagnosis [18].

Data mining describes identifying novel patterns and trends in large datasets and harnessing these insights to create predictive models [19]. By examining vast amounts of data, data mining can uncover patterns that may be too intricate or subtle for human analysts to detect [20]. Data mining primarily aims to discern correlations and patterns within data that are valid, novel, potentially useful, and comprehensible [21]. In contrast to traditional statistical analysis and data visualization techniques, data mining actively enables the discovery of new insights through machine learning algorithms, a key component of the broader field of artificial intelligence [7]. Data mining methodologies are generally categorized according to their intended function, encompassing descriptive and visual analysis, relationship and grouping analysis, and predictive modeling tasks like classification and estimation. Ultimately, these functionalities contribute to the broader domain of predictive modeling. Among these, predictive modeling, particularly classification, is a highly prevalent task. Classification involves predicting a categorical target variable by assigning new instances to one of several predefined classes. As a core pattern recognition component, classification has been approached through various methodologies. In healthcare, classification analysis is critical in disease categorization within clinical databases.

Previous studies frequently employed logistic and multinomial regression techniques to develop classification models. However, real-world classification problems often exhibit high levels of nonlinearity and complexity [22]. Consequently, contemporary research has investigated diverse nonlinear and sophisticated machine learning methodologies for extracting breast cancer patterns. These encompass approaches such as decision trees, naïve Bayes classifiers, artificial neural networks, support vector machines, fuzzy logic systems, genetic algorithms, particle swarm optimization, and integrated techniques combining learning and data mining strategies. Comprehensive surveys of various classification methods applied to breast cancer datasets have been presented in prior studies [5,22,23,24]. Experimental findings from these studies indicate that machine learning techniques consistently outperform traditional statistical methods regarding predictive accuracy. Consequently, increasing attention has been directed toward classification risk modeling using soft computing techniques.

Although evolutionary computing can generate interpretable classification rules with superior performance, it typically demands greater computational resources than conventional black-box approaches [22,23]. Enhancing the efficiency of rule extraction systems by developing high-performance evolutionary algorithms has become a critical area of research for effective knowledge discovery from large datasets. This study proposes an Adaptive Directed Mutation Real-Coded Genetic Algorithm (ADM-RCGA) [25] as a data mining method for deriving diagnostic knowledge rules for breast tumors. The derived rules aim to enhance disease control and management.

Genetic algorithms (GAs) have recently been increasingly applied in medical issues, providing efficient solutions for complex medical problems. In classifying diabetes mellitus, GAs have assisted in diagnostic decision making, showcasing their ability to optimize complex data analysis [26]. They have also demonstrated their effectiveness in optimizing hospital patient scheduling, significantly improving service efficiency [27]. Moreover, GAs have been applied in medical image segmentation and classification, enhancing feature selection and accuracy [28,29]. These applications demonstrate GAs’ versatility in medical research, making them a valuable tool for optimization, diagnosis, and decision making.

3. Methodology

3.1. ADM-RCGA

The effectiveness of a GA largely depends on the design of efficient search operators, which determine the system’s optimal performance. A key challenge in genetic algorithm (GA) performance is premature convergence. Addressing this challenge and avoiding algorithmic entrapment in local optima necessitates the mutation operator, which is crucial for introducing novel solutions and maintaining the diversity of the population during the search. However, this benefit often comes at the cost of slowing the convergence rate. Few studies have focused on designing novel mutation operators specifically for real-coded genetic algorithms (RCGAs) [25,30]. To address this gap, researchers [25] proposed an adaptive directed mutation (ADM) operator, which has been effectively utilized to tackle 41 challenging optimization challenges involving functions.

The ADM operator was designed to incorporate random mutation while combining the effects of crossover and unsystematic search mechanisms to prevent premature convergence or concentration around specific chromosomes. This operator enhances solution diversity within the population and improves the likelihood of discovering optimal solutions. Previously obtained solutions guide the search for new candidate points and follow the adaptive direction of the gradient, hence the term “adaptive directed mutation”. The evolutionary trajectory of each individual’s fitness value tends to converge along the gradient direction. To describe this trend, we define the fitness differences of chromosome x across three successive generations, namely, (t − 2), (t − 1), and (t), as follows:

∆ f (t) = f (x (t)) - f (x (t - 1))

(1)

∆ f (t - 1) = f (x (t - 1)) - f (x (t - 2))

(2)

Here, x = {x₁, x₂,…x_k,…x_N} denotes a chromosome, and f(x(t)) is its fitness at generation t. The gene-wise variation in the k-th dimension for chromosome x over the three generations is given by the following:

∆ x_{k} (t) = x_{k} (t) - x_{k} (t - 1)

(3)

∆ x_{k} (t - 1) = x_{k} (t - 1) - x_{k} (t - 2)

(4)

By incorporating Equations (1) through (4), the next generation value of gene xk is updated through the following expression:

x_{k} (t + 1) = x_{k} (t) + g (∆ f (t), ∆ f (t - 1), ∆ x_{k} (t) {, ∆ x_{k} (t - 1), x}_{k} (t), x_{k}^{U B}, x_{k}^{L B}) * p_{m}

(5)

where

x_{k}^{U B}

and

x_{k}^{L B}

refer to the upper and lower bounds of the gene value, respectively, and pm denotes the adaptive mutation probability.

In Equation (5), the function

g

(.) acts as an acceleration mechanism that regulates the directed mutation process. This study introduces four guided mutation strategies, applicable to any chromosome, each based on nine different evolutionary trends. These strategies are the following: (1) directional small-scale mutation, (2) random small-scale mutation, (3) random medium-scale mutation, and (4) random large-scale mutation. A comprehensive explanation of the ADM mechanism and its application can be found in Tan and Tseng [25].

The evolutionary process of the ADM-RCGA involves the following five key genetic operators: stochastic universal sampling (SUS) for selection, a variant of the blend crossover (BLX-α) combined with a crossover mask for recombination, an enhanced ADM operator for mutation, a multi-elite mechanism for elitism, and a replacement strategy. For a more in-depth explanation of these operators, refer to Tan and Tseng [25].

The ADM operator leverages local directional search strategies with adaptive random search mechanisms to substantially improve the performance of real-coded genetic algorithms (RCGAs) in achieving global optima and expediting convergence. Prior empirical findings have established that the ADM-RCGA [25] is fast, accurate, and reliable, outperforming six state-of-the-art evolutionary algorithms. Within this study, the ADM-RCGA is applied to the problem of rule extraction for knowledge discovery.

3.2. Discovery of Knowledge Rules Using ADM-RCGA

Data mining techniques typically represent the discovered knowledge as if–then prediction rules. This approach provides a high-level, symbolic representation of knowledge, making interpreting and understanding the extracted information easier [5,22,31,32,33]. This study introduces the ADM-RCGA as the primary data mining method for efficiently and swiftly generating knowledge rules for breast cancer classification.

3.2.1. Two-Level Malignant-Rule Mining Approach

In general, the mechanism of extracting multiple rules is commonly employed in rule-based mining systems [5,22,31] to enhance prediction accuracy. Within the rule-mining procedure, after a rule has been extracted, the data points still incorrectly classified by the preceding rule(s) will be used to guide the extraction of the following rule. This iterative process continues until all data are successfully classified [5,31]. However, relying on instances of false positives to mine subsequent rules may increase the risk of rule conflicts and negatively impact the overall performance of the classification system.

In medical applications, extracting applicable “malignant” rules to distinguish from “benign” cases is essential. A two-level hierarchical malignant-rule-mining approach is proposed to enhance the overall accuracy in the rule-mining procedure. The two-level malignant-rule framework is easier to understand by human intuition; thus, as shown in Figure 1, it can be used in mining ordered two malignant rules for predicting breast cancer diagnosis. Figure 2 illustrates the flow diagram of the proposed ADM-RCGA-based malignant-rule-mining approach. Some critical processes for extracting malignant rules from the training dataset involve rule representation, semi-automatic feature selection, and fitness evaluation of rules, which are explained as follows. In our method for rule mining, after the extraction of a malignant rule, only the data that were not classified by the preceding malignant rule(s) will be leveraged to derive the subsequent malignant rule.

3.2.2. Rules Representation and Semi-Automatic Feature Selection

Genetic representation encompasses a set of parameters tailored to solve particular problems. Various genetic representations provide distinct approaches to these challenges. If the user defines m as the number of selected feature variables, the rule structure will incorporate m variables. This illustrates the general structure of the rule produced by ADM-RCGA, which is as follows:

IF [(V1</≧C1) AND (V2</≧C2) AND …… AND (Vm</≧Cm) THEN diagnosis is malignant

(6)

If all m conditions are met, the model will categorize the breast cancer type as malignant. Here, V1 to Vm represent the features, the symbol “>” denotes the greater-than operator, the symbol “≦” indicates the less than or equal to operator, and C1 to Cm represent the corresponding threshold values. The feature variables V1 to Vm, the operator “>” or “≦”, and the cut-off values C1 to Cm are determined through the ADM-RCGA search process. Due to the fully automatic feature selection method, a large amount of CPU time is required in all EC-based mining systems [22]. In the proposed approach, the user uses a semi-automatic feature selection mechanism by setting the number of significant features in one rule, as shown in Table 1. This method is a good compromise strategy to significantly increase the overall computational efficiency of the rule-mining procedure.

To our knowledge, an RCGA can effectively handle numeric attributes but not nominal ones. This research adapts the proposed ADM-RCGA rule-mining framework to accommodate binary nominal attributes by introducing the operator detailed below. For example, both HSV-1 and HHV-8 are binary attributes (0: negative/1: positive); a rule may be in the form of “IF HSV-1 > 0.6 and HHV-8 ≦ 0.3, THEN diagnosis is breast cancer”. Since the range of “>0.6” contains one and the range of “≦0.3” contains 0, the final, concise rule becomes “IF HSV-1 = 1 (positive) and HHV-8 = 0 (negative), THEN diagnosis is breast cancer”.

3.2.3. Fitness Evaluation of Rules

A fitness function serves to quantify the merit of individual parameter sets within a given population, thereby guiding genetic algorithms (GAs) in discerning which sets exhibit a higher likelihood of persistence and refinement across generations. Carefully selecting a suitable fitness function is pivotal in practical optimization endeavors. The current research employs the GA methodology to pinpoint the rule that demonstrates superior performance from the existing pool of candidate rules. To concurrently assess the precision and scope of every rule, the rule-mining strategy presented herein utilizes a multi-component fitness measure, formulated as outlined below:

The fitness function = \frac{N u m b e r o f c a s e s f i r e d}{N u m b e r o f a l l c a s e s} * \frac{N u m b e r o f a c c u r a t e l y c l a s s i f i e d c a s e s}{N u m b e r o f c a s e s f i r e d} = \frac{N u m b e r o f a c c u r a t e l y c l a s s i f i e d c a s e s}{N u m b e r o f a l l c a s e s}

(7)

Objective : Maximize (\frac{N u m b e r o f a c c u r a t e l y c l a s s i f i e d c a s e s}{N u m b e r o f a l l c a s e s})

(8)

where

\frac{N u m b e r o f c a s e s f i r e d}{N u m b e r o f a l l c a s e s}

measures are taken to measure the coverage and accuracy of a rule [32,33]. The fitness function aims to optimize both predictive precision and the extent of data encompassed, acting as a robust selection criterion to identify regulations that attain the highest hit rate across the complete dataset.

4. Experimental Results

This investigation employed two distinct medical datasets to assess the efficacy of the proposed ADM-RCGA framework in discerning patterns within breast cancer. Furthermore, the developed model underwent a comparative analysis against existing models from prior research to evaluate its relative performance. Another aim was to ascertain the model’s resilience across varied medical datasets, substantiating its potential for broader applicability. All experimentation was conducted on a system featuring an Intel Pentium E2180 2 GHz central processing unit, complemented by 2 GB of random-access memory, operating under the Windows XP environment.

4.1. Breast Cancer Datasets

Two breast cancer diagnosis datasets originated from the American and Taiwanese populations. Data for the Wisconsin Breast Cancer Database (WBCD) were gathered at the University of Wisconsin [34] and contributed to the UCI Machine Learning Repository in 1991. The Chang-Shan Breast Cancer (CSBC) dataset was compiled at Chang-Shan Medical University and published in 2007 [35] and 2009 [33]. The classification task in both datasets involved distinguishing between benign and malignant breast tumors based on various physical or genetic attributes of cells or DNA. A brief description of these datasets’ characteristics is provided below.

4.1.1. The Wisconsin Breast Cancer Database (WBCD)

The WBCD dataset, a compilation of 699 cases derived from fine needle aspirates of human breast tissue, as detailed in Table 2, encompasses nine numerical characteristics per instance (excluding the sample identification number). The dataset evaluates several diagnostic features, such as clump thickness, cell size and shape consistency, marginal adhesion, size of individual epithelial cells, presence of bare nuclei, chromatin texture, visibility of nucleoli, and mitotic activity. Each characteristic is assessed using an integer scale ranging from 1 to 10, where lower scores (close to 1) typically indicate benign features, and higher scores (close to 10) reflect more severe or malignant traits. Sixteen instances in the dataset contain missing values. Following the approach proposed by Yeh et al. [5], these missing values were imputed using the most frequently occurring value for each respective attribute. Each instance is labeled as benign or malignant, with the dataset comprising 458 benign cases (65.52%) and 241 malignant cases (34.48%). Table 2 also presents descriptive statistics and the results of t-tests for the nine attributes, illustrating their significance in distinguishing between benign and malignant tumors. All nine features demonstrate statistical significance at the 95% confidence level, confirming their relevance for breast cancer classification in the WBCD dataset.

4.1.2. The Chung–Shan Breast Cancer (CSBC) Database

The CSBC dataset, a collection of 80 cases detailed in Table 3, was compiled using polymerase chain reaction (PCR) and Southern blotting techniques to analyze human breast tissue specimens for β-globin. Each case includes five nominal attributes representing the presence of the following DNA viruses: HSV-1, EBV, CMV, HPV, and HHV-8. The values for these attributes are categorical, denoting either a positive (detected) or negative (not detected) result for each virus. All 80 specimens underwent β-globin screening to ascertain the presence of DNA viruses. Each sample is further categorized into one of the following two classes: fibroadenoma (benign) or breast cancer (malignant). Within this dataset, 52 (65%) of the tissue samples were classified as non-familial invasive ductal breast cancer, while 28 (35%) were identified as mammary fibroadenomas.

Table 3 presents the descriptive statistics for each class within the CSBC dataset. Furthermore, to highlight the significance of the five DNA viruses, Table 3 also includes the p-values derived from a chi-squared (χ2) test. The statistical results presented in Table 3 clearly indicate that only HSV-1, CMV, and HHV-8 exhibit significant association, exceeding a 95% confidence level, in distinguishing between fibroadenoma (benign) and breast cancer (malignant) cases in the CSBC dataset.

The acquired samples were divided into the following two distinct subsets: a training set and a testing set. The training set served as the basis for developing classification rules, whereas the testing set—maintained independently of the rule induction process—was employed to evaluate the efficacy of the ADM-RCGA-based classifier. To facilitate a direct comparison with prior research [5,33], this study adhered to the same simulation parameters. Specifically, for the WBCD dataset, 466 instances (representing 66.7% of the total) were assigned to the training set, with the remaining 233 instances (33.3%) were allocated to the testing set. In the case of the CSBC dataset, 64 instances (80%) were designated for training, and 16 instances (20%) were reserved for testing purposes.

4.2. Comparative Results on the WBCD Dataset

Table 4 summarizes the experimental outcomes obtained by applying the ADM-RCGA to the WBCD dataset. To assess the proposed method’s effectiveness, the top 10 results from 50 independent executions are presented. The experimental setup maintained consistent configurations throughout, with the evolutionary process initialized using 30 individuals per population, capped at 100 generations. The algorithm employed a crossover rate set to 60%, a mutation rate of 10%, and retained 20% of the top-performing individuals through elitism. A criterion of 20 consecutive generations governed the evolutionary algorithm’s termination without any change in the best fitness value. In Rule 1, the highest classification accuracy achieved was 0.9292 for the training set and 0.9657 for the testing set. The corresponding average accuracies were 0.9272 and 0.9554, respectively. When misclassified instances remained, a new rule was generated using the hierarchical rule-mining strategy. For Rule 2, the best accuracy improved to 0.9506 and 0.9914 on the training and testing sets, respectively, with average accuracies of 0.9500 and 0.9867. The average CPU execution times were 33.64 s for training and 2.25 s for testing. These results demonstrate that the proposed ADM-RCGA framework offers a fast and accurate breast cancer pattern mining approach.

Table 5 summarizes the optimal decision rules generated by ADM-RCGA, DPSO, and BCGA for classifying instances in the WBCD dataset. The first rule derived using the proposed ADM-RCGA method was as follows: “IF (Uniformity of Cell Shape > 2.79 AND Bare Nuclei > 2.94), THEN diagnosis is malignant”. Upon iterating the mining process, a second rule was extracted as follows: “IF (Uniformity of Cell Size > 3.88 AND Bare Nuclei ≤ 2.36), THEN diagnosis is malignant”. Notably, the proposed system achieved a classification accuracy of 99.14% using only three features and two decision rules. Table 5 also presents the comparative results of the DPSO and BCGA approaches. The ADM-RCGA outperformed both alternatives, yielding an accuracy improvement of 2.145% over BCGA and 0.43% over DPSO. Furthermore, as shown in Table 6, the proposed ADM-RCGA system attained a sensitivity of up to 100% and a specificity of up to 98.81%. While the Type II error rates were comparable between ADM-RCGA and DPSO, ADM-RCGA exhibited a slight advantage in reducing Type I errors. To further validate the effectiveness of the proposed approach, a comparison with previously published rule extraction methods [5,25,31,36,37,38,39,40] was conducted. Table 7 demonstrates that the ADM-RCGA-based diagnostic system attains higher classification accuracy on the WBCD dataset than existing approaches, as evidenced by both the holdout and three-fold cross-validation methods.

4.3. Comparative Results on the CSBC Dataset

Table 8 displays the outcomes of implementing the proposed ADM-RCGA method on the CSBC dataset. The top 10 outcomes from a total of 50 experimental runs are reported to evaluate the model’s effectiveness. The training and testing accuracies were 0.83 and 0.70 in the first extracted rule, respectively. The second rule yielded improved accuracies of 0.85 for the training set and 0.90 for the testing set. The computation took an average of 2.59 s during training and 0.14 s during testing, reflecting the model’s efficiency in terms of runtime.

To strengthen the evidence supporting the reliability of the ADM-RCGA approach, a performance comparison was carried out against the BCGA model, previously introduced by Tseng and Laio [32], using the CSBC dataset. The comparative performance results are detailed in Table 9 and Table 10. Our findings indicate that employing the ADM-RCGA approach yielded an improvement in accuracy, raising the performance from 77% (achieved by the BCGA) to as much as 90%, representing an enhancement of up to 13 percentage points, sensitivity from 88.24% (BCGA) by as much as 11 percentage points, and specificity from 33.33% (BCGA) by up to 33.34 percentage points. Using the proposed ADM-RCGA approach, Type II performance errors can be improved from 11.76% (BCGA) to 5.88%, and Type I performance errors can be improved from 66.67% (BCGA) to 33.33%.

Table 11 lists the best decision rules derived from the ADM-RCGA and BCGA for the CSBC classification. In the proposed ADM-RCGA approach, the classification rule 1 was “IF (HSV-1 is negative) THEN malignant”. After the proposed hierarchical rule-mining process was followed, the classification rule 2 was “IF (HSV-1 is positive and HHV-8 is negative) THEN malignant”. Thus far, this study has applied only two DNA viruses and two rules to improve the accuracy by up to 90%. Table 8, Table 9, Table 10 and Table 11 show that the ADM-RCGA diagnostic system we developed achieves better classification accuracy on the CSBC classification than BCGA [32].

Including the CSBC dataset in this study provides meaningful complementary value. First, it demonstrates the ADM-RCGA framework’s ability to handle binary nominal attributes, common in molecular and virology-based diagnostics. Second, the CSBC dataset originates from a Taiwanese patient population, offering a regional perspective that complements the U.S.-based WBCD dataset and supports the method’s applicability across diverse clinical settings. Third, the extracted rules from CSBC (e.g., HSV-1 and HHV-8 status) reflect biologically and clinically relevant patterns, showcasing the algorithm’s potential to uncover interpretable insights beyond conventional cytological features. By validating ADM-RCGA on both numerical and categorical datasets from different populations, this study highlights the model’s flexibility, robustness, and potential for broader clinical integration.

5. Discussion

Table 12 compares the testing results of CatBoost [41] and XGBoost [42] on the WBCD dataset with the proposed ADM-RCGA method to perform a fair comparison with state-of-art ensemble tree-based classifiers. The results indicate that while XGBoost and CatBoost deliver excellent classification accuracy with significantly lower computational time, the ADM-RCGA method achieves the highest overall accuracy. More importantly, the ADM-RCGA generates explicit and interpretable decision rules tailored for clinical application, offering a key advantage in medical diagnostics where transparency and explainability are critical. Although the ADM-RCGA incurs a higher computational cost due to its evolutionary search and rule refinement mechanisms, this trade-off is justified by the enhanced clinical interpretability and diagnostic utility of the extracted rules, which are advantages not typically afforded by ensemble tree-based models.

As shown in Figure 3 and Figure 4, the ADM-RCGA achieved the highest classification accuracy (99.14%) among the three methods, while XGBoost and CatBoost both reached 98.57%. However, the ADM-RCGA requires more computational time due to its evolutionary optimization process. These visualizations clearly demonstrate the trade-off between accuracy and computational efficiency, as well as highlight the robustness and interpretability advantages of the ADM-RCGA in clinical applications.

In summary, the comparative advantages of the proposed ADM-RCGA framework over other classification methods, particularly in the context of medical data mining. The key advantages of ADM-RCGA include the following:

1.: Rule Interpretability: Unlike black-box models such as neural networks or ensemble tree methods (e.g., XGBoost and CatBoost), the ADM-RCGA generates explicit if–then rules that are concise and clinically interpretable. This is crucial in healthcare, where transparency and explainability are essential for physician trust and decision making.
2.: Two-Level Rule-Mining Strategy: The hierarchical rule extraction process allows for the ADM-RCGA to refine classification by focusing on misclassified instances, improving accuracy and coverage while minimizing rule conflict and redundancy.
3.: Adaptability to Mixed Data Types: the ADM-RCGA can handle numerical and binary nominal attributes. It suits diverse clinical datasets, such as cytological measurements (WBCD) and virology-based features (CSBC).
4.: Competitive Accuracy: As shown in our comparative experiments (Table 7 and Table 12), the ADM-RCGA achieves higher or comparable accuracy to state-of-the-art methods, including XGBoost and CatBoost, while offering rule transparency.
5.: Clinical Relevance: The extracted rules (e.g., “IF HSV-1 is negative THEN malignant”) align with known medical insights and can be directly used in diagnostic workflows, enhancing their practical utility.

6. Conclusions

The early identification of breast cancer represents the paramount strategy for improving the long-term survival prospects of affected individuals. In this investigation, we introduced a two-tiered approach for extracting malignant rules to uncover significant insights from comprehensive medical datasets, employing an enhanced RCGA model. The derived rules were successfully implemented to classify breast cancer risk within the WBCD and CSBC datasets. Our experimental results suggest that the ADM-RCGA method reached peak performance, achieving 99.14% when applied to the WBCD dataset and 90.00% in the case of the CSBC dataset. These outcomes are particularly encouraging when juxtaposed with previously documented rule-based classification techniques for breast cancer pattern discovery.

The results of our experiments robustly indicate that the ADM-RCGA methodology suggested in this study can markedly aid physicians in attaining accurate diagnoses and demonstrate considerable potential for its role in the clinical diagnosis of breast cancer. Furthermore, the two-level malignant-rule-mining process proved to be particularly insightful. For the WBCD dataset, the extracted rules included: R1: “IF (Uniformity of Cell Shape > 2.79 AND Bare Nuclei > 2.94) THEN malignant” and R2: “IF (Uniformity of Cell Size > 3.88 AND Bare Nuclei ≤ 2.36) THEN malignant”. In the case of the CSBC dataset, the derived rules were: R1: “IF (HSV-1 is negative) THEN malignant” and R2: “IF (HSV-1 is positive AND HHV-8 is negative) THEN malignant”. These transparent and interpretable decision rules offer valuable guidance to physicians before a definitive diagnosis is made. The experimental outcomes also support the effectiveness of the proposed ADM-RCGA method in extracting pertinent malignant rules from the training dataset, utilizing both numerical and binary nominal attributes.

To evaluate the efficacy of the proposed ADM-RCGA method, several rule extraction techniques from prior research were implemented for performance comparison. The experimental outcomes reveal that the ADM-RCGA method significantly outperforms all the previously employed approaches. In future research, we plan to systematically employ stratified k-fold cross-validation and oversampling and undersampling techniques to address small and imbalanced datasets such as CSBC. Furthermore, we will incorporate F1 score, precision, recall, and AUC as key evaluation metrics to ensure a comprehensive performance assessment. In addition, we plan to integrate rule-pruning mechanisms and conflict-detection algorithms to improve the interpretability and consistency of the rule set further.

Author Contributions

Conceptualization, M.-H.T. and H.-C.W. designed the study. M.-H.T. analyzed and interpreted the data. All authors prepared the manuscript. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Science and Technology Council, Taiwan, grant number: NSTC 113-2121-M-040-002.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The WBDC data that support the findings of this study are available in the [UCI Archive] at [https://archive.ics.uci.edu/dataset/15/breast+cancer+Wisconsin+original (accessed on 1 January 2022)], reference numbers [34]. The CSBC data supporting this study’s findings are available from the corresponding author upon reasonable request.

Acknowledgments

The authors thank Ping-Hung Tang for his support in program development.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Bray, F.; Laversanne, M.; Sung, H.; Ferlay, J.; Siegel, R.L.; Soerjomataram, I.; Jemal, A. Global cancer statistics 2022: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries. CA A Cancer J. Clin. 2024, 74, 229–263. [Google Scholar] [CrossRef]
Taiwan Cancer Registry. The Taiwan Cancer Registry. Cancer Statistics. Available online: https://twcr.tw/?page_id=1855&lang=en (accessed on 1 December 2024).
Health Promotion Administration, Ministry of Health and Welfare. Available online: http://health99.doh.gov.tw/Hot_News/h_NewsDetailN.aspx?TopIcNo=5585 (accessed on 3 July 2025).
Delen, D.; Walker, G.; Kadam, A. Predicting breast cancer survivability: A comparison of three data mining methods. Artif. Intell. Med. 2005, 34, 113–127. [Google Scholar] [CrossRef] [PubMed]
Yeh, W.-C.; Chang, W.-W.; Chung, Y.Y. A new hybrid approach for mining breast cancer pattern using discrete particle swarm optimization and statistical method. Expert Syst. Appl. 2009, 36, 8204–8211. [Google Scholar] [CrossRef]
Trybula, W.J. Data mining and knowledge discovery. Annu. Rev. Inf. Sci. Technol. 1997, 32, 197–229. [Google Scholar]
Roiger, R.; Geatz, M. Data Mining: A Tutorial-Based Primer; Addison-Wesley: Boston, MA, USA, 2003. [Google Scholar]
Bäck, T.; Schwefel, H.-P. An overview of evolutionary algorithms for parameter optimization. Evol. Comput. 1993, 1, 1–23. [Google Scholar] [CrossRef]
Coley, D.A. An Introduction to Genetic Algorithms for Scientists and Engineers; World Scientific Publishing Company: Singapore, 1999. [Google Scholar]
Goldberg, D.E. Genetic Algorithm in Search, Optimization and Machine Learning; Addison-Wesley: Boston, MA, USA, 1989; Volume 1, p. 9. [Google Scholar]
Goldberg, D.E. Real-coded genetic algorithms, virtual alphabets, and blocking. Complex Syst. 1991, 5, 139–167. [Google Scholar]
Tang, P.-H.; Tseng, M.-H. Medical data mining using BGA and RGA for weighting of features in fuzzy k-NN classification. In Proceedings of the 2009 International Conference on Machine Learning and Cybernetics, Baoding, China, 12–15 July 2009; pp. 3070–3075. [Google Scholar]
Alhijawi, B.; Awajan, A. Genetic algorithms: Theory, genetic operators, solutions, and applications. Evol. Intell. 2024, 17, 1245–1256. [Google Scholar] [CrossRef]
Xue, Y.; Wang, Y.; Liang, J.; Slowik, A. A self-adaptive mutation neural architecture search algorithm based on blocks. IEEE Comput. Intell. Mag. 2021, 16, 67–78. [Google Scholar] [CrossRef]
Hamdia, K.M.; Zhuang, X.; Rabczuk, T. An efficient optimization approach for designing machine learning models based on genetic algorithm. Neural Comput. Appl. 2021, 33, 1923–1933. [Google Scholar] [CrossRef]
Tseng, M.-H. GA-based weighted ensemble learning for multi-label aerial image classification using convolutional neural networks and vision transformers. Mach. Learn. Sci. Technol. 2023, 4, 045045. [Google Scholar] [CrossRef]
Koh, H.C.; Tan, G. Data mining applications in healthcare. J. Healthc. Inf. Manag. 2011, 19, 65. [Google Scholar]
Stilou, S.; Bamidis, P.D.; Maglaveras, N.; Pappas, C. Mining association rules from clinical databases: An intelligent diagnostic process in healthcare. Stud. Health Technol. Inform. 2001, 2, 1399–1403. [Google Scholar]
Kincade, K. Data mining: Digging for healthcare gold. Insur. Technol. 1998, 23, 2–7. [Google Scholar]
Kreuze, D. Debugging hospitals. Technol. Rev. 2001, 104, 32. [Google Scholar]
Chung, H.; Gray, P. Special Section: Data Mining. J. Manag. Inf. Syst. 1999, 16, 11–16. [Google Scholar] [CrossRef]
Tan, K.C.; Yu, Q.; Heng, C.; Lee, T.H. Evolutionary computing for knowledge discovery in medical diagnosis. Artif. Intell. Med. 2003, 27, 129–154. [Google Scholar] [CrossRef]
Chen, H.-L.; Yang, B.; Wang, G.; Wang, S.-J.; Liu, J.; Liu, D.-Y. Support vector machine based diagnostic system for breast cancer using swarm intelligence. J. Med. Syst. 2012, 36, 2505–2519. [Google Scholar] [CrossRef]
Fan, C.-Y.; Chang, P.-C.; Lin, J.-J.; Hsieh, J. A hybrid model combining case-based reasoning and fuzzy decision tree for medical data classification. Appl. Soft Comput. 2011, 11, 632–644. [Google Scholar] [CrossRef]
Tang, P.-H.; Tseng, M.-H. Adaptive directed mutation for real-coded genetic algorithms. Appl. Soft Comput. 2013, 13, 600–614. [Google Scholar] [CrossRef]
Azad, C.; Bhushan, B.; Sharma, R.; Shankar, A.; Singh, K.K.; Khamparia, A. Prediction model using SMOTE, genetic algorithm and decision tree (PMSGD) for classification of diabetes mellitus. Multimed. Syst. 2022, 28, 1289–1307. [Google Scholar] [CrossRef]
Jian, M.-S.; Wang, C.-H.; Wu, W.-S.; Huang, T.-W. Pipeline Based Genetic Algorithm for Patient Scheduling in Hospital Outpatient Department and Laboratory. In Proceedings of the 2024 26th International Conference on Advanced Communications Technology (ICACT), Pyeong Chang, Republic of Korea, 4–7 February 2024; pp. 163–167. [Google Scholar]
Maulik, U. Medical image segmentation using genetic algorithms. IEEE Trans. Inf. Technol. Biomed. 2009, 13, 166–173. [Google Scholar] [CrossRef] [PubMed]
Saxena, P.; Huque, S.; Vhatkar, S.; Ramana, K.V.; Durai, D.; Anand, C. Genetic Algorithm Optimization of Feature Selection for Medical Image Classification. ICTACT J. Soft Comput. 2024, 14, 3354–3360. [Google Scholar] [CrossRef]
Deep, K.; Thakur, M. A new mutation operator for real coded genetic algorithms. Appl. Math. Comput. 2007, 193, 211–230. [Google Scholar] [CrossRef]
Chen, T.-C.; Hsu, T.-C. A GAs based approach for mining breast cancer pattern. Expert Syst. Appl. 2006, 30, 674–681. [Google Scholar] [CrossRef]
Tseng, M.-H.; Chen, S.-J.; Hwang, G.-H.; Shen, M.-Y. A genetic algorithm rule-based approach for land-cover classification. ISPRS J. Photogramm. Remote Sens. 2008, 63, 202–212. [Google Scholar] [CrossRef]
Tseng, M.-H.; Liao, H.-C. The genetic algorithm for breast tumor diagnosis—The case of DNA viruses. Appl. Soft Comput. 2009, 9, 703–710. [Google Scholar] [CrossRef]
Street, W.N.; Wolberg, W.H.; Mangasarian, O.L. Nuclear feature extraction for breast tumor diagnosis. In Proceedings of the Biomedical Image Processing and Biomedical Visualization, San Jose, CA, USA, 1–4 February 1993; pp. 861–870. [Google Scholar]
Tsai, J.H.; Tsai, C.H.; Cheng, M.H.; Lin, S.J.; Xu, F.L.; Yang, C.C. Association of viral factors with non-familial breast cancer in Taiwan by comparison with non-cancerous, fibroadenoma, and thyroid tumor tissues. J. Med. Virol. 2005, 75, 276–281. [Google Scholar] [CrossRef]
Gadaras, I.; Mikhailov, L. An interpretable fuzzy rule-based classification methodology for medical diagnosis. Artif. Intell. Med. 2009, 47, 25–41. [Google Scholar] [CrossRef]
Nauck, D.; Kruse, R. Obtaining interpretable fuzzy classification rules from medical data. Artif. Intell. Med. 1999, 16, 149–169. [Google Scholar] [CrossRef]
Pena-Reyes, C.A.; Sipper, M. A fuzzy-genetic approach to breast cancer diagnosis. Artif. Intell. Med. 1999, 17, 131–155. [Google Scholar] [CrossRef]
Quinlan, J.R. Improved use of continuous attributes in C4. 5. J. Artif. Intell. Res. 1996, 4, 77–90. [Google Scholar] [CrossRef]
Setiono, R. Generating concise and accurate classification rules for breast cancer diagnosis. Artif. Intell. Med. 2000, 18, 205–219. [Google Scholar] [CrossRef] [PubMed]
Prokhorenkova, L.; Gusev, G.; Vorobev, A.; Dorogush, A.V.; Gulin, A. CatBoost: Unbiased boosting with categorical features. Adv. Neural Inf. Process. Syst. 2018, 31. Available online: https://proceedings.neurips.cc/paper_files/paper/2018 (accessed on 3 July 2025).
Chen, T.; Guestrin, C. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd Acm Sigkdd International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; pp. 785–794. [Google Scholar]

Figure 1. Two-level rule framework.

Figure 2. The flowchart of the ADM-RCGA mining approach.

Figure 3. Classification accuracy comparison.

Figure 4. CPU time comparison.

Table 1. The general rule structures.

Segment	IF Part				THEN Part
Features	V₁	V₂	……	V_m	Diagnosis
Logical Operator	(</≧)	(</≧)	……	(</≧)	is
Cutoff Values	C₁	C₂	……	C_m	Malignant

Table 2. Description statistics on the WBCD dataset.

Class	Benign		Malignant		t-Test
Instance	458 (65.52%)		241 (34.48%)
Attribute	Mean ± SD	Range	Mean ± SD	Range	p-Value
V₁: clump thickness	2.96 ± 1.67	1–8	7.20 ± 2.43	1–10	0.000
V₂: uniformity of cell size	1.33 ± 0.91	1–9	6.57 ± 2.72	1–10	0.000
V₃: uniformity of cell shape	1.44 ± 1.00	1–8	6.56 ± 2.56	1–10	0.000
V₄: marginal adhesion	1.36 ± 1.00	1–10	5.55 ± 3.21	1–10	0.000
V₅: single epithelial cell size	2.12 ± 0.92	1–10	5.30 ± 2.45	1–10	0.000
V₆: bare nuclei	1.35 ± 1.18	1–10	7.63 ± 3.12	1–10	0.000
V₇: bland chromatin	2.10 ± 1.08	1–7	5.98 ± 2.27	1–10	0.000
V₈: normal nucleoli	1.29 ± 1.06	1–9	5.86 ± 3.35	1–10	0.000
V₉: mitoses	1.06 ± 0.50	1–8	2.59 ± 2.56	1–10	0.000

Table 3. Descriptive statistics on the CSBC dataset.

Class	Benign	Malignant
Instance	28 (35%)	52 (65%)	χ²-Test
Attribute	No. (Positive/Negative)	No. (Positive/Negative)	p-Value
V₁: HSV-1	20/8 (71.4%/28.6%)	8/44 (15.4%/84.6%)	0.000 *
V₂: EBV	16/12 (57.1%/42.9%)	28/24 (53.8%/46.2%)	0.777
V₃: CMV	20/8 (71.4%/28.6%)	47/5 (90.4%/9.6%)	0.028 *
V₄: HPV	4/24 (14.3%/85.7%)	8/44 (15.4%/84.6%)	0.896
V₅: HHV-8	28/0 (100.0%/0.0%)	28/24 (53.8%/46.2%)	0.000 *

* p < 0.05.

Table 4. The experimental results of the overall accuracy for the WBCD classification.

Item	Rule	CPU Time(s)	No. of Generation	Training Accuracy (%)	Testing Accuracy (%)
Rule 1
1	V₃ > 2.79 and V₆ > 2.94	21.55	40	92.70	95.71
2	V₃ > 2.97 and V₆ > 2.97	35.35	51	92.70	95.71
3	V₃ > 2.40 and V₆ > 2.86	29.50	46	92.70	95.71
4	V₃ > 2.53 and V₆ > 2.74	45.19	60	92.70	95.71
5	V₃ > 2.16 and V₆ > 2.75	35.38	51	92.70	95.71
6	V₃ > 2.40 and V₆ > 2.89	35.84	51	92.70	95.71
7	V₃ > 1.51 and V₆ > 2.26	36.71	53	92.70	94.85
8	V₃ > 1.99 and V₆ > 2.23	25.52	43	92.70	94.85
9	V₃ > 2.98 and V₆ > 1.46	35.76	51	92.92	96.57
10	V₃ > 1.19 and V₆ > 2.71	35.62	51	92.70	94.85
	Average	33.64	49.70	92.72	95.54
	Standard deviation	6.26	5.27	0.07	0.52
Rule 2
1	V₂ > 3.88 and V₆ ≦ 2.36	3.36	26	95.06	99.14
2	V₂ > 1.48 and V₅ > 3.99	1.21	12	94.64	99.14
3	V₂ > 3.80 and V₆ ≦ 2.74	3.41	27	95.06	99.14
4	V₁ > 2.15 and V₂ > 3.80	1.19	12	95.06	98.71
5	V₂ > 3.16 and V₆ ≦ 8.74	1.94	18	95.06	98.71
6	V₂ > 3.90 and V₆ ≦ 5.14	2.14	20	95.06	98.71
7	V₂ > 3.16 and V₆ ≦ 6.25	3.07	24	95.06	98.28
8	V₂ > 3.68 and V₆ ≦ 7.61	1.89	15	95.06	98.28
9	V₁ > 3.74 and V₇ > 4.82	3.12	25	95.06	98.28
10	V₃ > 1.57 and V₅ > 3.48	1.15	12	94.85	98.28
	Average	2.25	19.10	95.00	98.67
	Standard deviation	0.88	5.82	0.13	0.36

Table 5. Comparison of the ADM-RCGA, DPSO, and BCGA for the WBCD classification.

Method	No.	Decision Rule	Accuracy (%)
			Training	Testing
ADM- RCGA	1	IF (Uniformity of Cell Shape > 2.79 and Bare Nuclei > 2.94) THEN malignant	Rule 1
			92.70	95.71
	2	IF (Uniformity of Cell Size > 3.88 and Bare Nuclei ≦ 2.36) THEN malignant	Rule1 + Rule2
			95.06	99.14
DPSO [5]	1	IF (Uniformity of Cell Size > 3 and Uniformity of Cell Shape > two and Bland Chromatin > 1) THEN malignant	Rule 1
			91.20	95.28
	2	IF (Clump Thickness > 3 and Bare Nuclei > 2) THEN malignant	Rule1 + Rule2
			95.05	98.71
BCGA [31]	1	IF (Uniformity of Cell Size > 2.4467 and Uniformity of Cell Shape > 2.5096) THEN malignant	Rule 1
			93.35	96.14
	2	IF (Bland Chromatin > 3.0526 and Clump Thickness > 3.1710) THEN malignant	Rule1 + Rule2
			95.39	96.54
	3	IF (Bare Nuclei > 3.0899 and Uniformity of Cell Size = 2) THEN malignant	Rule1 + Rule2 + Rule3
			95.60	96.995

Table 6. The results of the diagnosis for the WBCD classification.

Actual Class	Method	Classified Class
Actual Class	Method	I (Without Breast Cancer)	II (With Breast Cancer)
I (without breast cancer)	ADM-RCGA	166 (98.81%)	2 (1.19%)
	DPSO [5]	165 (98.21%)	3 (1.79%)
II (with breast cancer)	ADM-RCGA	0 (0%)	65 (100%)
	DPSO [5]	0 (0%)	65 (100%)

Table 7. The comparison results from testing the accuracy of the rule extraction methods for the WBCD classification.

Study	Technique	Average Accuracy (%)	Best Accuracy (%)	No. of Rules	No. of Features	Dataset Size (Training/Testing)
Quinlan [39]	C4.5	95.09	97.84	9	3	451/232
Nauck & Kruse [37]	Neuro-fuzzy	95.06	98.55	2	6	342/341
Peña-Reyse & Sipper [38]	Fuzzy-BCGA	96.02	97.8	3	9	512/171
Setiono [40]	Neuro	N.A.	98.25	3	5	341/342
Tan et al. [22]	RCGA-GP	97.57	99.13	6	9	451/232
Chen & Hsu [31]	BCGA	N.A.	96.995	3	5	466/233
Gadares & Mikhailoo [36]	Fuzzy	96.08	96.5	6	9	340/343
Yeh et al. [5]	DPSO	98.28	98.71	2	5	466/233
Proposed method	ADM-RCGA	98.67	99.14	2	3	466/233
Proposed method	ADM-RCGA	98.71 ± 0.35		2	3	3-fold CV

Table 8. The experimental results of overall accuracy for the CSBC classification.

Item	Rules	CPU Time(s)	No. of Generation	Training Accuracy (%)	Testing Accuracy (%)
Rule 1
1	V₁ = 0	2.39	24	83.33	70.00
2	V₁ = 0	2.93	29	83.33	70.00
3	V₁ = 0	2.29	23	83.33	70.00
4	V₁ = 0	2.71	27	83.33	70.00
5	V₁ = 0	2.72	27	83.33	70.00
6	V₁ = 0	2.50	25	83.33	70.00
7	V₁ = 0	2.89	29	83.33	70.00
8	V₁ = 0	2.80	28	83.33	70.00
9	V₁ = 0	2.30	23	83.33	70.00
10	V₁ = 0	2.40	24	83.33	70.00
	Average	2.59	25.90	83.33	70.00
	Standard deviation	0.23	2.26	0.00	0.00
Rule 2
1	V₁ = 1 and V₅ = 0	0.11	12	85.00	90.00
2	V₁ = 1 and V₅ = 0	0.11	12	85.00	90.00
3	V₁ = 1 and V₅ = 0	0.13	14	85.00	90.00
4	V₁ = 1 and V₅ = 0	0.11	12	85.00	90.00
5	V₁ = 1 and V₅ = 0	0.12	13	85.00	90.00
6	V₃ = 1 and V₅ = 0	0.11	12	85.00	90.00
7	V₃ = 1 and V₅ = 0	0.18	20	85.00	90.00
8	V₃ = 1 and V₅ = 0	0.11	12	85.00	90.00
9	V₃ = 1 and V₅ = 0	0.20	22	85.00	90.00
10	V₃ = 1 and V₅ = 0	0.19	21	85.00	90.00
	Average	0.14	15.00	85.00	90.00
	Standard deviation	0.04	4.00	0.00	0.00

Table 9. The comparison results of testing the accuracy of rule extraction methods for the CSBC classification.

Study	Technique	Average Accuracy (%)	Best Accuracy (%)	No. of Rules	No. of Features	Dataset Size (Training/Testing)
Tseng & Liao [33]	BCGA	77	80	1	3	80/20
Proposed method	ADM-RCGA	90	90	2	3	80/20

Table 10. The results of the diagnosis for the CSBC classification.

Actual Class	Method	Classified Class
Actual Class	Method	I (Without Breast Cancer)	II (With Breast Cancer)
I (without breast cancer)	ADM-RCGA	2 (66.67%)	1 (33.33%)
	BCGA	1 (33.33%)	2 (66.67%)
II (with breast cancer)	ADM-RCGA	1 (5.88%)	16 (94.12%)
	BCGA	2 (11.76%)	15 (88.24%)

Table 11. Comparison of the ADM-RCGA and BCGA for the CSBC classification.

Method	No.	Decision Rules	Accuracy (%)
			Training	Testing
ADM- RCGA	1	IF (HSV-1 is negative) THEN malignant	Rule 1
			83.33	70.00
	2	IF (HSV-1 is positive and HHV-8 is negative) THEN malignant	Rule1 + Rule2
			85.00	90.00
BCGA	1	IF (HSV-1 is negative and EBV is positive or negative, and HHV-8 is positive or negative) THEN malignant	Rule 1
			80.00	80.00

Table 12. Comparison of the ADM-RCGA and ensemble tree methods for the WBDC classification.

Technique	Test Accuracy (%)	CPU Time(s)
XGBoost	98.57	0.5
CatBoost	98.57	2.8
ADM-RCGA	99.14	35.89

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wu, H.-C.; Tseng, M.-H. A Two-Level Rule-Mining Approach to Classify Breast Cancer Patterns Using Adaptive Directed Mutation and Genetic Algorithm. Eng 2025, 6, 154. https://doi.org/10.3390/eng6070154

AMA Style

Wu H-C, Tseng M-H. A Two-Level Rule-Mining Approach to Classify Breast Cancer Patterns Using Adaptive Directed Mutation and Genetic Algorithm. Eng. 2025; 6(7):154. https://doi.org/10.3390/eng6070154

Chicago/Turabian Style

Wu, Hui-Ching, and Ming-Hseng Tseng. 2025. "A Two-Level Rule-Mining Approach to Classify Breast Cancer Patterns Using Adaptive Directed Mutation and Genetic Algorithm" Eng 6, no. 7: 154. https://doi.org/10.3390/eng6070154

APA Style

Wu, H.-C., & Tseng, M.-H. (2025). A Two-Level Rule-Mining Approach to Classify Breast Cancer Patterns Using Adaptive Directed Mutation and Genetic Algorithm. Eng, 6(7), 154. https://doi.org/10.3390/eng6070154

Article Menu

A Two-Level Rule-Mining Approach to Classify Breast Cancer Patterns Using Adaptive Directed Mutation and Genetic Algorithm

Abstract

1. Introduction

2. Related Work

2.1. Genetic Algorithm Techniques

2.2. Data Mining Applications in Healthcare

3. Methodology

3.1. ADM-RCGA

3.2. Discovery of Knowledge Rules Using ADM-RCGA

3.2.1. Two-Level Malignant-Rule Mining Approach

3.2.2. Rules Representation and Semi-Automatic Feature Selection

3.2.3. Fitness Evaluation of Rules

4. Experimental Results

4.1. Breast Cancer Datasets

4.1.1. The Wisconsin Breast Cancer Database (WBCD)

4.1.2. The Chung–Shan Breast Cancer (CSBC) Database

4.2. Comparative Results on the WBCD Dataset

4.3. Comparative Results on the CSBC Dataset

5. Discussion

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI