A Novel Bioinspired Algorithm for Mixed and Incomplete Breast Cancer Data Classification

The pre-diagnosis of cancer has been approached from various perspectives, so it is imperative to continue improving classification algorithms to achieve early diagnosis of the disease and improve patient survival. In the medical field, there are data that, for various reasons, are lost. There are also datasets that mix numerical and categorical values. Very few algorithms classify datasets with such characteristics. Therefore, this study proposes the modification of an existing algorithm for the classification of cancer. The said algorithm showed excellent results compared with classical classification algorithms. The AISAC-MMD (Mixed and Missing Data) is based on the AISAC and was modified to work with datasets with missing and mixed values. It showed significantly better performance than bio-inspired or classical classification algorithms. Statistical analysis established that the AISAC-MMD significantly outperformed the Nearest Neighbor, C4.5, Naïve Bayes, ALVOT, Naïve Associative Classifier, AIRS1, Immunos1, and CLONALG algorithms in conducting breast cancer classification.


Introduction
Cancer is a global problem that causes one in four deaths [1]. In men, the three most common cancers are lung, colon, and prostate, while in women, the most common cancers are breast and colorectal.
There are more than 27 different types of cancer [2], which is alarming as it is the second leading cause of death worldwide. The development of this disease is based on various criteria, such as gender, genetics, and race, among others [3]. Using non-invasive techniques allows medics and researchers to identify cancer early, allowing better treatment for patients, thereby saving lives.
For breast cancer, the pre-diagnosis process may vary according to the type and stage of cancer. However, some non-invasive studies are based on obtaining a digital image through a study (magnetic resonance, mammography, etc.) and then segmenting the region of interest (lesion). The characteristics of the lesion are obtained, and finally, the image is classified.
Several algorithms have been used for cancer classification. Due to the "No free lunch theorem" [4], there is no perfect classification algorithm; therefore, research on breast cancer classification continues to be an area of interest [5][6][7][8][9][10][11].
In this study, we use a metaheuristic based on the human immune system; this is an algorithm that imitates the behavior of fauna or a biological system to solve computational problems [12]. Due to their behavior, these algorithms are commonly used to solve nondeterministic problems since they are based on guiding a random solution in a defined search space [13,14].
It is important to emphasize that in medical datasets, mixed data are common; that is, data consisting of categorical and numerical values. Values may also be missing due to various factors. This is relevant given that most clinical data require pre-classification treatment.
In this study, we will work on the classification task, for which we propose a classification algorithm based on the human immune system. Currently, some classifiers work with mixed data. To the best of our knowledge, none of these algorithms is bio-inspired. However, bio-inspired models have been beneficial and widely used in medical diagnosis. For this reason, we propose a bio-inspired classification algorithm that can handle mixed and incomplete data.
This paper makes several contributions. We designed an Artificial Immune System for Associative Classification of Mixed and Missing Data (AISAC-MMD). This is a novel, immune-based classification algorithm that allows native dealing with multiclass, mixed, and incomplete data. This algorithm has low computational complexity.
The paper is structured as follows: Section 2 briefly addresses some of the previous works on computationally assisted breast cancer classification and pre-diagnosis. Section 3 explains the materials and methods used. Section 4 presents the results, detailing the newly proposed classification algorithm, while Section 5 discusses the numerical performance of the AISAC-MMD with respect to state-of-the-art classification algorithms. The paper ends with the conclusions and directions for future study.

Related Works
Over the last 5 years, research has been published on breast cancer pre-diagnosis using classification algorithms, such as the work of Amrane et al. [5], which tested KNN and Naïve Bayes algorithms applied to breast cancer classification for binary datasets. The results revealed that KNN yielded better accuracy than Naïve Bayes for breast cancer classification.
In 2019, Saritas and Yasar [6] analyzed classification algorithms (Artificial Neural Networks and Naïve Bayes) applied to the classification of breast cancer using biomarkers. The results showed excellent performance of these two algorithms, with Artificial Neural Networks obtaining the greatest accuracy. In the same year, Ting et al. [7] proposed Convolutional Neural Networks for breast cancer classification using medical images. The results revealed high classification accuracy. Their work was tested on a real dataset of 221 patients classified into three groups (malignant, benign, and healthy).
Numerous studies have examined the classification of breast cancer; however, this is not only cancer to be pre-diagnosis. For example, some papers, such as the recent work of Yuan et al. in 2019, used a classification method based on a magnetic resonance model to classify a dataset of patients with prostate cancer [8]. The model yielded good results in treating and classifying magnetic resonance images for prostate cancer.
In early 2020, Acharya et al. [9] proposed a combination of enhancing image preprocessing and deep learning algorithms to improve the classification of algorithms applied to breast cancer datasets. This modification showed better accuracy for the classification algorithms tested. A similar approach was proposed by Arif et al. (2020) [10], who reviewed deep learning approaches for classifying prostate cancer using magnetic resonance images. They concluded that new validations and clinical studies should be conducted to obtain better decision-making algorithms.
In 2020, Devarriya et al. [11] proposed two fitness functions for Genetic Programming. These were used for breast cancer classification, and showed good performance with imbalanced datasets. The first approach was based on learning about the minority class, while the second approach was based on according the same importance to both classes. Based on reviews conducted in our previous works, there are opportunities for improvement. This study proposes modifying a classification algorithm based on the human immune system, demonstrating promising results.
An interesting proposal based on bio-inspired algorithms is put forward by González-Patiño et al. [15], yielding promising results for breast cancer classification. Recently, deep learning has been analyzed, and has been reported as a useful tool for this task [16][17][18]. In addition, there has been an increase over the past year in the use of bio-inspired techniques for automatic breast cancer detection [19][20][21].
However, the above-mentioned proposals only deal with numeric and complete data. Therefore, these methods need to take the additional step of data pre-processing to impute (or even delete) missing records, and to change categorical values into numeric ones. Such procedures alter the nature of the data and can lead to poor performance. This study aims to overcome these drawbacks by designing a novel algorithm that is able to natively deal with mixed and missing data.

Materials and Methods
This section describes the datasets, performance measures, and algorithms that were compared. Nine algorithms were tested for the classification of ten datasets.

Datasets
In this study, we used ten datasets related to different types of cancer. It is important to note that the datasets contained missing and mixed values, which is quite common in medical datasets.

1.
Breast Cancer Digital Repository (BCDR) [22]. This dataset is composed of data extracted from Portuguese women after being tested with biopsies to identify breast lesions. As stated in [22], "BCDR-F01 has a total of 362 segmentations from which 187 are from benign findings and the remainder 175 from malignant findings. In addition to the patient age and breast density, the data set includes a set of selected binary attributes for indicating abnormalities observed by radiologists, namely masses, microcalcifications, calcifications (other than microcalcifications), axillary adenopathies, architectural distortions, and stroma distortions. Thus, the clinical data for each instance of the BCDR-F01 data set include a total of eight attributes per instance: six binary attributes related to observed abnormalities, an ordinal attribute for breast density, and a numerical attribute that contains the patient age at the time of the study." 2.
Breast Cancer Wisconsin (Original) Data Set (BCWO) [23]. This dataset was provided by the UCI repository [24] and is available at http://archive. Breast Cancer SEER (BCSEER) [25]. The National Cancer Institute provides this dataset, which consists of real patients from 1973 to 2013 who underwent breast cancer-related studies. The institute provides the surveillance, epidemiology, and end results (SEER) database. The SEER database classifies cancer histology and topography information based on the third edition of the International Classifications of Diseases for Oncology (ICD-O-3). In our study, we used the version of the dataset available on the Kaggle website (https://www.kaggle.com/code/jnegrini/breast-cancer-dataset, accessed on 11 January 2021).

4.
Breast Cancer Wisconsin (Diagnostic) Data Set (BCWD) [26]. This binary dataset, provided by Dr. Wolberg in 1995, consists of data obtained from breast analysis and subsequently confirmed by biopsy. Features are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. They describe the characteristics of the cell nuclei present in the image. These features include radius (mean of distances from center to points on the perimeter), texture (standard deviation of gray-scale values), perimeter, area, smoothness (local variation in radius lengths), compactness (perimeterˆ2/area − 1.0), concavity (severity of concave portions of the contour), concave points (number of concave portions of the contour), symmetry, and fractal dimension ("coastline approximation" − 1). The dataset is available at http://archive. ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29, accessed on 11 January 2021. 5.
Breast Cancer Wisconsin (Prognostic) Data Set (BCWP) [27]. This dataset was provided by Dr. Wolberg and contained data on breast cancer patients with invasive breast cancer. This dataset was donated in the same year as the BCWD. Each record represents follow-up data on one breast cancer case. These are consecutive patients seen by Dr. Wolberg since 1984 and include only those cases exhibiting invasive breast cancer and no evidence of distant metastases at the time of diagnosis. The dataset has 32 predictive attributes, with the first 30 computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. They describe the characteristics of the cell nuclei present in the image. The other two attributes are recurrence time (in case of recurrence) and disease-free time (in case of non-recurrence). This dataset is available at http://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%2 8Prognostic%29, accessed on 11 January 2021. 6.
Lung Cancer Data Set (LCDS) [18]. This dataset was chosen as it contains information on patients who had surgeries. The dataset, which was donated in 1999, focuses on the survival of these patients after surgery. It is an interesting dataset due to the scarcity of the data (only 32 subjects) and the large amount of predictive features (55). It is available at http://archive.ics.uci.edu/ml/datasets/Lung+Cancer, accessed on 11 January 2021. 7.
Mammographic Mass Data Set (MMDS) [28]. We considered the existence of missing values, the number of instances and attributes, and the imbalance ratio (IR). A dataset is considered imbalanced if the IR measure exceeds 1.5 [32]. All datasets had two classes except for the LCDS dataset, which had three.

Algorithms
Eight algorithms were selected. The first five algorithms were chosen since they work with mixed and missing data, which is one of the main contributions of the proposed model in this study. The following three algorithms were based on the same principle as the proposed model; that is, they on an Artificial Immune System. This is why they were selected for comparison against other algorithms of the same type.

1.
K-Nearest Neighbors (NN) was proposed by Cover and Hart in 1967 [33]. This algorithm is based on assigning a class according to the k nearest pattern. If the pattern belongs to different classes, a majority voting process will be carried out to obtain a single class. 2.
C4.5 [34] was developed as a modification of ID3 [35]. It is a decision tree for making decisions based on relevant information provided by each attribute.

3.
Naïve Bayes [36] is a classifier based on probability and the independence of each attribute. It is derived from Bayes' theorem. 4.
ALVOT is a general purpose classification model that uses different views of information based on a Support Set System [37]. This model uses a voting schema based on aggregation procedures. The model has a high computational cost when using all typical testors, but it can obtain good results with mixed and incomplete data.

5.
NAC was proposed in 2017 by Villuendas-Rey et al. [38] as a learning model for classifying mixed and incomplete data. It is based on a similarity operator named MIDSO, and is a particular case of both the ALVOT and NN classifiers. It has low computational complexity and yields good results when applied to financial data.
6. AIRS1 is a classification algorithm based on the Artificial Immune System, The algorithm was proposed in 2001 [39], based on the principle of clonal selection and affinity maturation. 7.
Immunos1 is another algorithm that reduces information in one training iteration. It was proposed in 2005 [40]. 8.
CLONALG is an algorithm based on the principle of clonal selection for classification. Each prototype improves the recognition of patterns in each iteration due to the affinity function. This algorithm was proposed in 2002 [41].
It should be noted that these last three algorithms do not operate with missing or mixed values, which is why an imputation was necessary. Table 2 shows the parameters of the compared algorithms; we used the default parameters, as proposed in the original implementations.

Performance Measure
Due to data imbalances, we used the Balanced Accuracy measure, also known as macro average accuracy [42]. Balanced Accuracy is based on calculating each class's accuracy and subsequently averaging that accuracy.
This measure can be easily calculated if we use the Confusion matrix, which presents correctly classified patterns for each class. Figure 1 shows an example of a Confusion matrix for three classes. The general formula for Balance Accuracy is presented in Equation (1), where is the Recall of the class , and is the number of classes.

Results
Our proposal is based on the recently introduced Artificial Immune System for Associative Classification (AISAC) [15]. Our aim was to address AISAC's main drawback of The general formula for Balance Accuracy is presented in Equation (1), where S i is the Recall of the class i, and k is the number of classes.

Results
Our proposal is based on the recently introduced Artificial Immune System for Associative Classification (AISAC) [15]. Our aim was to address AISAC's main drawback of not working with missing or mixed data (MMD), given that several medical datasets have these characteristics. Based on the AISAC, we proposed modifications that yielded better performance. Thus, we offered a solution to problems associated with the AISAC through a novel algorithm named the Artificial Immune System for Associative Classification in Mixed and Missing Data (AISAC-MMD).
The proposed algorithm incorporates several modifications of MMD, as shown in Figure 2.
the Recall of the class , and is the number of classes.

Results
Our proposal is based on the recently introduced Artificial Immune System sociative Classification (AISAC) [15]. Our aim was to address AISAC's main draw not working with missing or mixed data (MMD), given that several medical datase these characteristics. Based on the AISAC, we proposed modifications that yielde performance. Thus, we offered a solution to problems associated with the AISAC t a novel algorithm named the Artificial Immune System for Associative Classific Mixed and Missing Data (AISAC-MMD).
The proposed algorithm incorporates several modifications of MMD, as sh Figure 2. To explain the variants and modifications introduced in the proposed AISAC we use the pseudocode presented in Figure 2 to better explain the changes in each In Figure 3, we present the modification of the Adaptive Immune Response uses missing and mixed data. With regard to data structures, we stored the trainin a list of instances, and consider that each instance has a decision class. We require similarity function to compare two instances (user-defined), a fitness function to the quality of the created prototype set, and the associated performance measure defined). To explain the variants and modifications introduced in the proposed AISAC-MMD, we use the pseudocode presented in Figure 2 to better explain the changes in each phase.
In Figure 3, we present the modification of the Adaptive Immune Response, which uses missing and mixed data. With regard to data structures, we stored the training set as a list of instances, and consider that each instance has a decision class. We required a dissimilarity function to compare two instances (user-defined), a fitness function to assess the quality of the created prototype set, and the associated performance measure (used-defined).
We start by dividing the training set by Hold-Out. Then, we will create several clusters (bags) to initially structure the data (Phase 1). In Phase 2, we merge the instances in the bags, thereby obtaining the initial prototype set to represent the data. After that, the algorithm undergoes an iterative process (Phases 3 and 4). Phase 3 "moves" the instances in such a way that the performance measure is optimized. After that, to avoid overfitting, Phase 4 creates clones and obtains a new set of prototypes. At the end of the iterative process, the algorithm stores the final prototype set in memory.
For the distance calculation, we set a parameter for the Dissimilarity function. In our experiments, we use the HEOM dissimilarity. Similarly, we modified the Adjusting function (Adapt), as presented in Figure 4, in which we changed the dissimilarity function. We start by dividing the training set by Hold-Out. Then, we will create several clusters (bags) to initially structure the data (Phase 1). In Phase 2, we merge the instances in the bags, thereby obtaining the initial prototype set to represent the data. After that, the algorithm undergoes an iterative process (Phases 3 and 4). Phase 3 "moves" the instances in such a way that the performance measure is optimized. After that, to avoid overfitting, Phase 4 creates clones and obtains a new set of prototypes. At the end of the iterative process, the algorithm stores the final prototype set in memory.
For the distance calculation, we set a parameter for the Dissimilarity function. In our experiments, we use the HEOM dissimilarity. Similarly, we modified the Adjusting function (Adapt), as presented in Figure 4, in which we changed the dissimilarity function. In cases of patterns with missing values, which are selected as the closest elements for a specific pattern in any part of the algorithm, for the computation process of the prototype, the missing values are substituted by the mean value for numeric attributes or by the mode for categorical attributes. This allows us to update the prototypes without modifying the original patterns. This is the first bio-inspired classifier that works with mixed and missing information without transforming the data. In other words, the AISAC-MMD maintains the missing and mixed values without imputing the attributes and including artificial values. It will be beneficial in the medical field since most datasets have these characteristics.
The following section discusses the comparison between the proposed AISAC-MMD and existing classifiers. In cases of patterns with missing values, which are selected as the closest elements for a specific pattern in any part of the algorithm, for the computation process of the prototype, the missing values are substituted by the mean value for numeric attributes or by the mode for categorical attributes. This allows us to update the prototypes without modifying the original patterns. This is the first bio-inspired classifier that works with mixed and missing information without transforming the data. In other words, the AISAC-MMD maintains the missing and mixed values without imputing the attributes and including artificial values. It will be beneficial in the medical field since most datasets have these characteristics.
The following section discusses the comparison between the proposed AISAC-MMD and existing classifiers.

Discussion
We used the 10 datasets described in Section 3.1 to assess the performance of the AISAC-MMD. The experiments were conducted on a desktop computer with a 64 bit Windows 10 Enterprise operating system, an Intel i5-6500 processor at 3.20 GH, and 16 GB of RAM. As this was a work computer, all experiments were carried out under low priority.
We compared the datasets with the nine classification algorithms for breast cancerrelated prediction. First, we compared the AISAC-MMD against classical classification algorithms that work with mixed data and missing values. The results are presented in Table 3. We used a Balanced Accuracy measure (Equation (1)) due to the high degree of imbalance present in the datasets (Table 1). In this way, we managed to avoid bias toward the majority classes. The AISAC-MMD obtained the best performance for seven out of ten datasets, compared with other algorithms that work with missing and mixed values. The best performance for each dataset is highlighted in bold.

Discussion
We used the 10 datasets described in Section 3.1 to assess the performance of the AISAC-MMD. The experiments were conducted on a desktop computer with a 64 bit Windows 10 Enterprise operating system, an Intel i5-6500 processor at 3.20 GH, and 16 GB of RAM. As this was a work computer, all experiments were carried out under low priority.
We compared the datasets with the nine classification algorithms for breast cancerrelated prediction. First, we compared the AISAC-MMD against classical classification algorithms that work with mixed data and missing values. The results are presented in Table 3. We used a Balanced Accuracy measure (Equation (1)) due to the high degree of imbalance present in the datasets (Table 1). In this way, we managed to avoid bias toward the majority classes. The AISAC-MMD obtained the best performance for seven out of ten datasets, compared with other algorithms that work with missing and mixed values. The best performance for each dataset is highlighted in bold. The second comparison was performed on algorithms based on artificial immune systems (Table 4). Again, the best results for each dataset are indicated in bold. Regarding algorithms based on the same principle, the AISAC-MMD obtained the best performance for nine datasets. With these results, we proceeded to perform a statistical test.
We conducted the Wilcoxon test, which identifies the presence or absence of differences in performance between various algorithms. This test is based on selecting an algorithm and comparing it with another. In this case, we compared the new AISAC-MMD model with other algorithms.
The statistical test (Wilcoxon test) to compare the algorithms in the same datasets is presented in the next section. This test is widely used to identify differences in performances comparing several algorithms [43].
The Wilcoxon signed-rank was used in this study. The comparison is presented in Table 5, considering α = 0.05, which means values lower than that represent the rejection of the null hypothesis H0. Hypothesis H0 states that there are no differences in the performance of the compared algorithms. We set a confidence level of 95%. We first performed the test to compare the AISAC-MMD against the classical algorithms that work with missing and mixed data (Table 5). Concerning the literature algorithms, the null hypothesis H0 was rejected in all algorithms. Therefore, the AISAC-MMD outperformed these algorithms. These algorithms are based on the same principle as the Artificial Immune System, and the AISAC-MMD performed better, as demonstrated by the statistical test.
In summary, the AISAC-MMD outperformed all eight classification algorithms. Comparing the new modification with its previous version, the AISAC-MMD performed well, in addition to working with mixed data and missing values.

Conclusions
In this study, we introduced the first bio-inspired classification algorithm that is able to natively deal with missing and mixed data. The advantages of this algorithm are: Its ability to handle missing and mixed data without any pre-processing; this is useful since most datasets present missing values and mixed attributes.

1.
Its creation of a reduced prototype set; this decreases storage complexity, making it suitable for hardware implementation in devices associated with other medical devices, such as mammographs, etc. 2.
Its ease of use and good performance, which allows doctors to make decisions when there is high demand in the analysis of mammographic studies. 3.
The main limitation of the proposal is that, as with most metaheuristics, it has several parameters. This helps to improve the algorithm's performance by varying the values of the parameters.
In this study, no parameter adjustment was performed nor were different configurations tested. This aspect can be examined in future research to improve the performance of the algorithm. Finally, the use of other strategies can be examined to further explore this research area. Data Availability Statement: All datasets are publicly available from the Machine Learning Repository of the University of California at Irvine [18] (https://archive.ics.uci.edu/ml/datasets.php, accessed on 11 January 2021) except the ones in [22], which are available upon request.