Next Article in Journal
Who Will Author the Synthetic Texts? Evoking Multiple Personas from Large Language Models to Represent Users’ Associative Thesauri
Previous Article in Journal
A Survey on the Applications of Cloud Computing in the Industrial Internet of Things
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Impact on Classification Process Generated by Corrupted Features

1
Department of Computer Science and Information Technology, Faculty of Automation, Computers, Electrical Engineering and Electronics, Dunarea de Jos University of Galati, 47 Domneasca Str., 800008 Galati, Romania
2
The Modelling & Simulation Laboratory, Dunarea de Jos University of Galati, 47 Domneasca Str., 800008 Galati, Romania
3
Department of Administration, Dunarea de Jos University of Galati, 47 Domneasca Str., 800008 Galati, Romania
*
Author to whom correspondence should be addressed.
Big Data Cogn. Comput. 2025, 9(2), 45; https://doi.org/10.3390/bdcc9020045
Submission received: 6 October 2024 / Revised: 31 January 2025 / Accepted: 10 February 2025 / Published: 18 February 2025

Abstract

The topic of this study is the testing of the robustness of machine learning (ML) and neural network (NN) models with a new idea based on corrupted data. Typically, ML and NN classifiers are trained on real feature data; however, a portion of the features may be false, with noise, or incorrect. The undesired content was analyzed in eight experiments with false data, six with feature noise, and six with label noise. These tests were all conducted on the public Breast Cancer Wisconsin Dataset (BCWD). Throughout this, the false and noise data were gradually corrupted in a random way, generating new data and replacing raw features that belonged to the BCWD. Artificial Intelligence (AI) should be properly selected while categorizing different diseases using medical data. The Pearson correlation coefficient (PCC) applied between features monitored their correlation in each experiment, and a correlation matrix between both true and false features was used. Four machine learning (ML) algorithms—Random Forest (RF), XGBClassifier (XGB), K-Nearest Neighbors (KNN), and Support Vector Machine (SVM)—were used, as well as for the analysis of important features (IF) and the binary classification. The study was completed using three deep neural networks—a simple Deep Neural Network (DNN), a Convolutional Neural Network (CNN), and a Transformer Neural Network (TNN). In the context of a binary classification, the accuracy, F1-score, Area Under the Curve (AUC), and Matthews correlation coefficient (MCC) metrics of the performance of classification in malignant versus benign breast cancer (BC) was computed. The results demonstrated the robustness of some methods and the sensitivity of other machine learning algorithms in the context of corrupted data, computational cost, and hyperparameters optimization.

1. Introduction

The healthcare system collects data from different sources, sometimes using recalibrated acquisition devices. This problem and the doctor’s attention can influence the quality of the collected data. The inadequate acquisition system leads to unwanted artifacts such as noisy, incomplete, inaccurate, or unclean data, and these affect the classification of diseases when different AI tools are used. In the statistics field, the elimination of artifacts, outliers, or noise from data is difficult work [1]. The accurate data have an essential role in medicine applications or statistics because, after their interpretations, a prognostic about population health can be deduced. Frequently, AI tools, such as ML or DNN, are trained with medical data in order to make a prediction, multiclass classification, or for a regression approach. The problem that arises, in most cases, is when the analyzed data are clean and reflect reality. Therefore, the testing of the sensibility of ML classifiers at different data types is imperative in the classification process.
The corrupted data can occur at different levels, labels, patterns, or features. One of the causes can be human errors in manual data labeling. Another source of errors could be the manual delineations in an image of the ground truth of the lesions made by a physician. Sometimes, the methods for the detection of patterns can contain errors in the processing step, meaning that not all features that form a dataset are relevant. In multiple cases, feature selection is a complex task; it is carried out with dedicated algorithms so that the irrelevant features can be removed, because irrelevant or redundant information affects the quality of feature classifications.
In the following study, contemporary references are highlighted, where the various methods of data corruption and analysis are covered. Two training-free methods based on neighborhood information for detecting corrupted labels were proposed by Zhu et al. [2]: the first method called for “local voting” by examining the noisy label consensuses of surrounding features treated, and the second method used a ranking-based methodology to score each instance and eliminate a predetermined number of instances that were likely to be corrupted. Rankin et al. [3] verified the ability of a supervised machine learning model trained on synthetic data to categorize real data accurately, and this is a crucial indicator of the data utility of a synthetic dataset for machine learning purposes. This establishes that if supervised machine learning models are trained on synthetic data exclusively, they will be sufficiently resilient to categorize real data samples. Torfi et al. [4] investigated the effect of classification with DNN, fed with synthetic data created with GANs. The evaluation of the performance of seven datasets was performed using techniques for synthetic data generation. A synthetic dataset for cancer created from the publicly available cancer registry data from the Surveillance Epidemiology and End Results program with DT, Logistic Regression (LR), and Neural Network (NN) was classified. The false data affect the classification with DNN [2], Stochastic Gradient Descent (SGD), decision tree (DT), K-Nearest Neighbors (KNN), Random Forest (RF), Support Vector Machine (SVM), DNN [4], or LR, NN, and DT [5].
The main purpose of this study, which is to confirm the sensitivity of AI tools when trained with false data, is presented in the next eight sections. Section 2 contains a summary of the up-to-date papers. In Section 3 of this study, the objectives, methodology, and proposals are described. In Section 4, the information about the used dataset is provided. A brief analysis of ML and DNN is conducted in Section 5. The mathematical approaches to the Pearson correlation coefficient, confusion matrix, and other metrics are provided in Section 6. Section 7 “Results and Discussions” describes all experiments in detail and their results, as well as the preliminary solutions, future directions, and limitations of the study. Finally, in Section 8, we conclude our study with a summary and future research.

2. Related Work

One can apply the corrupted data to either the raw features or the labels. Different noise types corrupted the supplied data or replaced them with placeholders during the verification of the reliability of neural networks or machine learning algorithms.
In order to handle public electroencephalography data, Banville et al. [6] proposed a dynamic spatial filter that plugs in before the first layer of a neural network and machine learning. With this procedure, they evaluated the corrupted data from the signal quality.
Budach et al. [7] used an empirical method to look into the connection between fifteen ML models and six dimensions of data quality. They study the behaviors of ML in the presence of corrupted data. Different types of placeholders replaced the missing values in the corruption process.
Wu et al. [8] corrupted chemical data with two types of noise: Gaussian and non-Gaussian noise. ML techniques show significant potential in modeling large datasets, but the challenge of utilizing corrupted data has limited their application to chemical plant data, as most machine learning applications in the literature remain confined to deterministic scenarios.
Alhajeri et al. [9] approximated a class of multi-input–multi-output nonlinear systems using recurrent neural networks and the same data corruption method as before.
Most supervised classification and regression tasks assume that the provided labels accurately represent the ground truth. This premise is frequently contravened. For instance, physicians interpreting the same medical image may have divergent subjective assessments of the diagnosis, resulting in variability in the ground truth label itself. In different contexts, such heterogeneity may stem from sensor noise, data input errors, human annotator subjectivity, or several other factors. The labels utilized for training machine learning (ML) models may frequently be imprecise, as they may not always represent the ground truth [10,11].
Hendrycks et al. [12] showed that label noise makes machine learning systems much less effective when they use labeled data. Because large datasets are becoming more important in deep learning, being able to handle label noise is an important trait for classifiers.

3. Proposed Methodology

The main contribution of this paper is to test the behavior of various AI algorithms on a consecrated BCWD when the features are contaminated with false and noise data. Four ML models and three neural networks of DNN, CNN, and transformer architecture were fed with the following datasets: (i) row data and corrupted data; (ii) raw data and features noise; and (iii) label noise. In the case of false and noise data, the selected features from the BCWD were corrupted step-by-step, where records for all features with subunit values were replaced with a random subunit value. At every step, the correlation between features was expressed by a correlation matrix, and the classification with RF, XGB, SVM, and KNN and DNN, CNN, and transformers was followed. As previously explained, the same features were affected by noise, and then the label noise was added. For each experiment, the accuracy, F1-score, MCC, and AUC were computed. The proposed framework with all stages is shown in Figure 1.
In the Experiments/Sets stage, the names of the corrupted features are marked in green. Based on the proposed experiments, we tried to answer the following questions:
RQ1: Can corrupted data be correlated with raw data?
RQ2: Can false data influence important feature selection in a binary classification?
RQ3: Can corrupted data influence the classification process with DNN, CNN, and transformers?
RQ4: Can false data influence the classification with RF and XGB, KNN, and SVM?
RG5: Can the noise affect the binary classification?
RQ6: What metrics can quantify the corruption of the data?

4. Information Sources and Search Strategy

The literature widely uses the public Breast Cancer Wisconsin Dataset for testing performance algorithms, comparing breast cancer classification models [13,14], employing ensemble classifiers and feature selection [15], and comparing ML algorithms [16,17], intelligent fuzzy systems [18], deep learning paradigms for BC diagnosis [19], or principal component analysis [20].
In 1993, Street et al. [21] published the BCWD, verifying only the custom date range of 1993–2024. During this period of time, 8792 scientific papers used this database. A graphical representation of the occurrence of the BCWD in academic publishing platforms is shown in Figure 2.
Being a dataset with a large applicability, the number of publications that used the BCWD was counted. The following online bibliographic databases were searched on 5 August 2024, where the number of apparitions of keywords “Breast Cancer Wisconsin Dataset” between brackets is specified: IEEE Xplore (347), ACM Digital Library (767), SpringerLink (3940), ScienceDirect (3087), Nature (494), PubMed (36), and MDPI (26).
Data that belonged to the BCWD corresponded to 569 patients, with benign (357 cases) and malignant (212 cases) tumor diagnoses [21]. There are 32 quantitative features in the dataset: id, diagnosis, radius_mean, texture_mean, perimeter_mean, area_mean, smoothness_mean, compactness_mean, concavity_mean, concave points_mean, symmetry_mean, fractal_dimension_mean, radius_se, texture_se, perimeter_se, area_se, smoothness_se, compactness_se, concavity_se, concave points_se, symmetry_se, fractal_dimension_se, radius_worst, texture_worst, perimeter_worst, area_worst, smoothness_worst, compactness_worst, concavity_worst, concave points_worst, symmetry_worst, and fractal_dimension_worst.
All experiments were conducted on a PC with the following configuration: CPU AMD Ryzen 5 5500, 6 cores, 12 Threads 3.6 GHz/4.2 GHz, GPU AMD R7 200 2 GB VRAM, RAM 32 GB, OS Windows 10, and SSD 1 TB.
The software environment utilized Python (3.12.7) and the following libraries: scikit-learn (1.5.2), TensorFlow (2.18.0), Keras (3.6.0), Keras Tuner (1.4.7), and Tokenizers (0.21.0).
The time associated with experiments, hardware, software environments, and the effectiveness of AI tools is expressed as time-consuming. This metric was used to train and test the training time and testing/inference time for each method and each dataset.
To evaluate the performance of the selected ML classification algorithms when trained with partially false data, the BCWD was transformed through two selection processes. Firstly, features having suffixes _se and _worst were removed (because these features are highly correlated with the ones having _mean), and only the diagnosis (as target feature) and the features with _mean, such as radius, texture, perimeter, area, smoothness, compactness, concavity, concave points, symmetry, and fractal dimension (FD), were kept. The second step consisted of the selection of the features with records having values between 0 and 1, and replacing them with a random subunit value. In this sense, only the following features were manipulated: smoothness, compactness, concavity, concave points, symmetry, and FD. Therefore, in Table 1, eight datasets are formed, with the symbol “*” marked as the manipulated features and “-” as the raw feature. The diagnosis attribute was the target feature in both ML and DNN.
Label noise and feature inaccuracies frequently cause problems in real-world datasets, arising from human error, inadequate quality control, or the constraints of automated labeling systems. The noisy labels can severely impact the robustness of IA algorithms; therefore, the present study aims to quantitatively assess label noise and feature noise through four ML and three NN models and classification performance in the medical BCWD by manipulating them with Gaussian noise.
The identical sets presented in Table 1 were influenced by Gaussian noise, resulting in the creation of additional datasets designated in the flowchart as set-ni, where i = 2, 3, …, 7.
The diagnostic target feature in the BCWD was also influenced by noise; these new sets in the flowchart are designated as set-ti, where i = 2, 3, …, 7.
The datasets of this work are uploaded in a repository available at: https://github.com/simonamoldovanu/corrupted_data/issues/1 (accessed on 12 January 2025). In all experiments, the datasets were split into 90% for training and 10% for testing. For DNN, CNN, and Transformer, the training part (90% of the original dataset) is split internally into train and valid with parameter validation_split = 0.3.
As is seen in Table 1, the radius, texture, perimeter, and area remain unchanged, because they have superunit values.

5. Short Description of Artificial Intelligence Algorithms Used in Experiments

5.1. Machine Learning Algorithms

The RF, XGB, KNN, and SVM are ML classifiers suitable for analyzing high-dimensional and complex data and developing valuable predictions. The Standard RF method uses a bootstrapped version of the training dataset to build each decision tree in the ensemble. The repetitive partition develops each tree by repeatedly applying the same node-splitting technique, starting at the root node and continuing until specific stopping conditions are satisfied. Decision trees, which are collections of numerous, weaker learners, are the source of their predictive strength [22,23].
XGB is a decision tree’s natural extension that combines multiple decision trees to determine the final result, rather than relying on just one. It can be used for supervised learning problems, such as ranking, classification, and regression. Additionally, it uses a variety of weak estimating techniques to produce an estimated model. Similarly to previous boosting procedures, a phase method permits the optimization of a random differentiable loss function, therefore generalizing the model. “Boosting” is a tree-generation technique that builds robust new trees from pre-existing ones by using gradient descent. Directing the desired function takes the quickest route feasible [24,25].
In the KNN algorithm, the “K” stands for the number of the new data point’s neighbors. The first step in this algorithm is to choose an appropriate value for K. Selecting the correct value for K is essential for increased accuracy, known as parameter tuning. Depending on the dataset, a very high value of K can occasionally confuse the results, whereas a very low value, such as 1 or 2, can produce noisy results [26].
The SVM algorithm, one of the most widely used supervised learning algorithms, is designed to solve regression and classification issues. The goal of the SVM method is to create the best possible decision limit or hyperplane that divides n-dimensional space into different classes and makes it simple to assign a different point to the appropriate category. The SVM algorithm selects extreme vector points known as support vectors, which aid in the creation of a suitable hyperplane [26].

5.2. Deep Neural Network (DNN)

A DNN is a feed-forward neural network with numerous layers of nonlinearity and transformations. Every layer’s output provides input for the layer after it. This distinguishes a DNN from a normal neural network, which consists of a small number of neurons. Numerous layers in DNNs facilitate more sophisticated trait learning and more demanding computational tasks, such as concurrently performing numerous complex processes. It performs better than ML in machine perception tasks involving unstructured datasets. This is due to the ability of DL algorithms to gradually learn from their own errors. It may assess the precision of its forecasts and outputs and adjust as needed [27,28]. The proposed DNN had three dense layers; the first two contained ReLu as an activation function and the third Sigmoid. In each experiment, the accuracy and loss are shown. The accuracy score is the number of accurate predictions obtained. The values that show the variance from the targeted state are known as loss values.

5.3. Convolutional Neural Network (CNN)

Convolutional Neural Networks (CNN) are deep neural networks usually used for processing data with a grid-like topology, such as images. They use convolutional layers to extract spatial hierarchies of features, enabling the efficient learning of patterns and relationships in data. CNNs are widely used in medical imaging, computer vision, and other applications due to their ability to reduce the number of parameters while retaining high representational power. The CNN structure for tabular data is a 1D Convolutional Neural Network designed to classify binary data, with key preprocessing and architectural details. The input to the CNN is the tabular data, where each sample (row of the dataset) is normalized and reshaped to have an additional dimension (for the single feature channel). This ensures compatibility with the Conv1D layer. The convolutional layer used is the Conv1D layer, which applies 1D convolutions to learn feature interactions across adjacent columns in the data. We have also used a Flatten Layer, which flattens the 2D feature map output from the convolutional layer into a 1D vector. This prepares the data for dense layers [29].

5.4. Transformer Neural Network (TNN)

Transformer Neural Networks (TNNs) use attention mechanisms to process sequences of data, capturing dependencies across long distances. They are extensively applied in natural language processing tasks, image analysis, and classification problems, including medical data analysis. Transformers can be adapted for tabular data by treating each row as a sequence of features. Instead of traditional methods for tabular data, the Transformer uses an attention mechanism to model the relationships between features explicitly, enabling it to learn both global and local dependencies effectively. In the data preparation phase, each row of tabular data is transformed into a sentence by concatenating feature values into a string. This allows for the use of a pre-trained tokenizer and Transformer model. In the embedding generation phase, the Transformer (BERT) tokenizes the input data and generates embeddings for each sequence. These embeddings capture the semantic relationships between feature values. In the classification phase, the embeddings generated by the Transformer are fed into a dense neural network for classification or regression tasks [30].

6. Mathematical Approaches

6.1. Correlation Matrix

A matrix that displays the correlation between variables is called a CM. In matrix format, it provides the correlation between each potential pair of variables.
A CM can be used to condense a sizable amount of information, identify patterns, and inform decisions. Additionally, we are able to display our results and determine which variable has a stronger correlation with the other [31]. Each component of CM contains a Pearson correlation coefficient (PCC) between two variables. For a sample of size N with variables x and y and x ¯ and y ¯ averages of N samples, the PCC is defined as [32] the following:
r = i = 1 N x i x ¯ y i x y ¯ i = 1 N x i x ¯ 2 i = 1 N y i y ¯ 2
If r = 1 , the relationship is strong or a perfect positive linear relationship.
If r = 0 , it means the relationship is neutral.
If r = 1 , it means the relationship is negative or a not strong or perfect negative linear relationship [33].

6.2. Confusion Matrix and Metrics

Numerous metrics are available in the literature to assess a machine learning technique’s success. Several diverse areas, including engineering and medicine, to name a few, use performance evaluation. When dealing with binary classification problems, such as RF or XGB, a confusion matrix is frequently utilized to illustrate the algorithm’s predictions in comparison to the actual values. These values can be used to create various metrics that make it feasible to assess a model’s quality. The accuracy (Acc), F1-score, area under the curve (AUC) metrics, Matthews correlation coefficient (MCC) [34], and confusion matrix are expressed by Equations (2)–(8). The MCC is especially advantageous in scenarios of class imbalance, being used with success in the domains of machine learning.
Confusion   matrix   T P F P F N T N
A c c = T P + T N T P + T N + F N + F P
p r e c i s i o n = T P T P + F P
r e c a l l = T P T P + F N
F 1 s c o r e = p r e c i s i o n · r e c a l l p r e c i s i o n + r e c a l l
A U C = 1 1 2 F P F P + T N + F N F N + T P
M C C = T P · T N F P . F N T P + F P · T P + F N · T N + F P · T N + F N
where TP is true positives, the FP is false positives, the TN is true negatives, and the FN is false negative samples [23,24,27].

7. Results and Discussions

The proposed research is based on eight experiments. At every step, the original and corrupted features from the dataset were specified (see Table 1). In all experiments, the values of features diagnosis, radius, texture, perimeter, and area remained unchanged. For the results of each experiment, the following graphs are presented: the correlation matrix for showing pairwise correlation coefficients; loss and accuracy curves for training and validation of DNN; and important features selected by the RF and XGB ML algorithms. In these graphs, computed important features are highlighted with a red color. A feature becomes important if it exceeds the overall average. Table 2 outlines a synthesis of important features, enabling one to track the trend of their selection, regardless of whether the data are raw or false.
The accuracy and loss for the validation and train sets are shown in Figure 3b, Figure 4b, Figure 5b, Figure 6b, Figure 7b, Figure 8b, Figure 9b and Figure 10b for 200 epochs for DNN. The results were measured by Acc, F1-score, and AUC metrics, for both RF and XGB classifiers, and DNN, and displayed for each experiment in Table 3. After presenting results from experiments consisting of the following elements—confusion matrix, loss and accuracy graphs, and feature importance graph—a discussion about correlation features, training and validation datasets, and important features, is described in the following paragraphs.
The models of ML, DNN, CNN, and Transformer are defined or represented by their parameters. Along the experiments, the following important parameters of ML and DNN architecture stored in Table 2 were used.

7.1. Experiment #1

In the first experiment, the ML and DNN were fed with all raw features: radius, texture, perimeter, area, smoothness, compactness, concavity, concave points, symmetry, and FD, as in Table 1, set 1. As Figure 3a shows, there are differences in correlation between various pairwise features. There is a very strong positive correlation between radius, texture, perimeter, and area, but important features are concavity, concave point (RF), and concavity (XGB).

7.2. Experiment #2

In Experiment #2, we explored the following raw features: radius, texture, perimeter, area, smoothness, compactness, concavity, concave points, and symmetry, and only one feature—FD—is corrupted, as in Table 1, set 2. Figure 4a shows a good correlation between raw pairwise features and a worst correlation between FD and raw features. Also, the important features detected by RF and XGB are the same as in Experiment #1. Also, there is a very strong positive correlation between radius, texture, perimeter, and area, but important features are changed: concave point (FR), radius, perimeter, area, concavity, and concave point (XGB).

7.3. Experiment #3

In Experiment #3, the raw features are radius, texture, perimeter, area, smoothness, compactness, concavity, and concave points, and two features—symmetry and FD—are corrupted, as in Table 1, set 3. Figure 5a shows a very strong positive correlation between raw pairwise features and a very weak negative correlation between FD, symmetry, and raw features. The important features remained the same as in Experiment #2.

7.4. Experiment #4

Experiment #4 incorporates raw features such as radius, texture, perimeter, area, smoothness, compactness, and concavity, and three features are corrupted—concave points, symmetry, and FD—as in Table 1, set 4. This experiment indicates a good correlation between raw features and a worst correlation between corrupted features (FD, symmetry, concave points, and raw features). In comparison with previous experiments, the important features detected by RF are radius, perimeter, area, and concavity, and by XGB perimeter, area, and concavity, respectively. The false data generated for FD, symmetry, and concave points make the relationships between these and raw features to be a very weak positive correlation or very weak negative correlation.

7.5. Experiment #5

The proposal of this experiment is the following raw features: radius, texture, perimeter, area, smoothness, compactness, and false data represented by concavity, concave points, symmetry, and FD features, as in Table 1, set 5. The important features detected by RF are perimeter and area, and by XGB radius perimeter and area, respectively. The trend of correlation between features is kept, as is seen in Figure 7a, and there is a very weak positive correlation or very weak negative correlation between raw and corrupted features.

7.6. Experiment #6

This experiment with the same criteria was analyzed. The weights of uncorrelated features are held by the raw features radius, texture, perimeter, area, and smoothness, and compactness, concavity, concave points, symmetry, and FD contain false data, as in Table 1, set 6. Following the same criteria as previous experiments, Experiment #6 reveals that RF detects important features such as perimeter and area, while XGB detects features such as radius, texture, perimeter area, and smoothness. This is observed in Figure 8a, and although the smoothness is not formed from the data, it becomes uncorrelated with raw features.

7.7. Experiment #7

In Experiment #7, the raw data are radius, texture, perimeter, and area, and smoothness, compactness, concavity, concave points, symmetry, and FD are false data, as in Table 1, set 7. In this case, important features detected by RF are perimeter and area, and by XGB radius, texture, and perimeter area, respectively. This experiment has the highest number of uncorrelated features; a very strong positive correlation has the raw features.

7.8. Experiment #8

In the last experiment, all features that were used in the above experiments and replaced with false data were excluded, and as a result, only the radius, texture, perimeter, and area were kept. The correlation matrices point out a strong correlation between all features selected in Experiment #8, and area and perimeter are important features. After 75 epochs, the DNN has an approximately linear validation of data.
Each experiment involved the monitoring of important features, with the AI algorithms in Table 3 marked with the symbols “*” for important features and “-” for weak features.
We have described eight experiments that provide insights into gradual training ML and DNN models with false data. A general observation concerning all experiments is related to the correlation between pairwise features. Figure 3a, Figure 4a, Figure 5a, Figure 6a, Figure 7a, Figure 8a, Figure 9a and Figure 10a show the correlation matrices. Starting with Experiment #2, the data were corrupted, and the false features remained uncorrelated with the original features.
A summary of important features is shown in Table 3 and the preponderance of importance features by perimeter and area features is given. The corrupted data influenced the selection of important features; throughout the experiment, the false features did not become important features.
The training of DNN for a sufficient number of epochs was performed, and the trained DNN made predictions on eight datasets. By analyzing Figure 3b, Figure 4b, Figure 5b, Figure 6b, Figure 7b, Figure 8b, Figure 9b and Figure 10b, it can be noticed that in Experiment #7, the loss increases a lot, by over 0.37 (see Figure 10b), and the accuracy decreases considerably. After this experiment, it was futile to continue with others, because the corrupted data very much affected the training of DNN. The study continues with an analysis of the results provided by ML, and these are shown in Table 4.
In parallel with the empirical results obtained in eight experiments, only sets 2 to 7 were affected by two types of noise, in keeping with the trend from Table 1. Set 1 and set 8 were kept the same because they contain raw data, and in Experiment #1 to Experiment #8, are not proceeded by false data. In Table 4, the values of four metrics are stored: confusion matrices and time-consuming values obtained in a binary classification with four machine learning algorithms, and three neural networks when the features were affected by noise in Table 5 and in Table 6 when the target feature is affected by noise, respectively.
For an easy interpretation of the experimental data and comparison of the experiments, the accuracy, AUC, MCC, and F1-score values are shown in Figure 11, Figure 12 and Figure 13. The metrics for each AI algorithm and experiment are shown. Each figure highlights the number of experiments and computed metrics.
Upon examining the graphs in Figure 10 and the values in Table 4, CNN ranks highest among the AI algorithms used in the suggested experiments, achieving an accuracy of 0.982 in Experiments #1, #2, and #3. The second classifier, DNN, attained an accuracy of 0.965 in Experiments #4, #6, and #7. DNN and KNN produced favorable results, with an accuracy of 0.965 in Experiments #4 and #7. The transformer exhibited the lowest performance across all experiments, with an accuracy of 0.912. Both CNN and DNN showed deficiencies in Experiment #8, where the feature count was lower than in the other experiments. RF, CNN, and XGB increased the time-consuming value on the training sets. The RF, CNN, and SVM show the lowest training times on the test sets.
The outcomes acquired in this instance are documented in Table 4 and Figure 12. Upon analyzing the metric values presented by all proposed AI algorithms, the following findings emerged: The CNN yielded the highest results for all experiments (accuracy of 0.982). At the same time, the SVM exhibited equivalent performance in Experiment #n7. Subsequently, DNN and RF achieved an accuracy of 0.965 for Experiments #t2, #t4, and #t4 and for Experiments #t3 and #t4, respectively. The previously analyzed datasets and these datasets indicate that the transformer and XGB yield suboptimal outcomes.
Table 5 and Figure 13 show the results obtained for set-ti and Experiments #ti, where i = 2, 3, …, 7. Noise affected the labels in these experiments. Following the results, we can notice that the classification of data failed. The highest accuracy is around 0.5. This is clear proof that this type of noise leads to unsatisfactory results. The MCC metric indicates total disagreement between prediction and observation, which is negative for the majority of classifiers.
Regarding computational time, traditional ML methods have lower training and inference times and the Deep Learning methods DNN, CNN, and transformer have a high computational cost. CNN and DNN have good accuracy across experiments, with DNN having a higher computational cost. RF, SVM, and KNN have lower computational time and good accuracy for feature noise. The Transformer from Deep Learning models has the best results for noise on target experiments, and it has a high computational cost. KNN from traditional ML models also has very good results.

7.9. Hyperparameters Optimization

Hyperparameter optimization is an important step in building machine learning models that are efficient. Model parameters are learned during training. Hyperparameters are set before training and influence model performance. The scope of hyperparameter optimization is to find the best combination of settings that maximize the model’s accuracy, minimize its error, and ensure robustness. In this study, hyperparameter optimization was applied to all presented algorithms: DNN, RF, XGB, SVM, KNN, CNN, and Transformer. Each algorithm has its own set of hyperparameters that directly impact its learning process, generalization ability, and computational efficiency. This section aims to improve the models’ ability to classify data accurately, even in scenarios in which corrupted data and noise on features and targets are present. In the optimization process, each hyperparameter was explored for various values.

7.9.1. DNN Hyperparameters Optimization

DNN hyperparameters control the architecture and learning process of the neural network, influencing its ability to generalize and adapt to different datasets. The parameters tested included the learning rate, which determines the step size in the gradient descent optimization, with values of [0.0001, 0.001, 0.01], affecting how quickly or smoothly the model converges. Additionally, dropout_rate_1 and dropout_rate_2, representing the fraction of neurons randomly deactivated during training to prevent overfitting, were tested with values [0.1, 0.2, 0.3, 0.4, 0.5] to evaluate their impact on model robustness and generalization. Experiments #1 to #8 tested the model on various dataset versions, showing strong performance with 0.947–0.982 accuracy, where a higher learning rate (0.01) improved results in some cases. Feature noise experiments (#n2–#n7) caused performance drops (0.702–0.982), with higher dropout not significantly mitigating noise, but certain datasets still achieved high accuracy. Label noise experiments (#t2–#t7) severely impacted accuracy (0.316–0.596), with most results around 0.333, indicating that the model struggled with incorrect labels regardless of hyperparameter tuning. We can see that lower dropout rates and moderate learning rates generally performed well. High learning rates were effective for noisy feature datasets. Noise on labels significantly impacted model accuracy and shows the limitations of optimal hyperparameters when dealing with noise on targets.

7.9.2. RF Hyperparameters Optimization

RF hyperparameters optimization focused on controlling tree construction and behavior in the ensemble. The tested parameters included n_estimators (number of trees: 50, 100, 200), criterion (split quality: “gini” for impurity reduction “entropy” for maximizing information gain), and max_depth (tree depth: None, 10, 20). For Experiments #1 to #8, accuracy remained high (0.947–0.982), with entropy-based models and deeper trees performing slightly better in some cases. Notably, Experiment #6 achieved the highest accuracy (0.982) using entropy, max_depth = 20, and n_estimators = 50, while Experiments #5 and #7 also performed well at 0.965. For feature noise experiments (#n2–#n7), accuracy remained stable (0.947–0.965), indicating RF’s robustness to feature noise. However, label noise experiments (#t2–#t7) showed significant performance degradation (0.351–0.439), with the best result at 0.439 in Experiment #t5 using entropy, no depth restriction, and n_estimators = 50, highlighting the model’s sensitivity to mislabeled data. The best hyperparameters varied depending on the type of data corruption. Higher n_estimators and a maximum depth of 20 performed well in most scenarios, and tailored tuning for specific datasets further improved results. This shows that RF is adaptable for noisy datasets.

7.9.3. XGB Hyperparameters Optimization

XGB hyperparameters optimization focused on fine-tuning max_depth (3, 5, 7), learning_rate (0.01, 0.1, 0.2), and n_estimators (50, 100, 200) to enhance gradient boosting performance. For Experiments #1 to #8, accuracy remained consistently high (0.930–0.965), with lower max_depth (3–5) and moderate learning rates (0.1–0.2) performing best. Notably, Experiments #4, #5, and #6 achieved 0.965 accuracy, indicating that max_depth = 5 with a learning rate of 0.1 or 0.2 being optimal. For feature noise experiments (#n2–#n7), accuracy stabilized at 0.965, suggesting XGB’s strong resilience to corrupted features. However, label noise experiments (#t2–#t7) suffered a drastic drop to 0.368 across all cases, highlighting XGB’s vulnerability to incorrect labels, even with deeper trees and adjusted learning rates. XGB performed well in most cases in handling corrupted features, but did not perform well in experiments with label noise. The optimal combination of a moderate max_depth, learning_rate, and sufficient n_estimators provided strong results in most scenarios. The model’s performance was low in label noise experiments. These results showed that XGB is adaptable for noisy tabular data when appropriately optimized.

7.9.4. SVM Hyperparameters Optimization

SVM hyperparameters optimization focused on tuning C (0.1, 1, 10) for regularization, gamma (0.1, 0.5, 1) for kernel coefficient, and kernel type (“linear”, “rbf”) to optimize decision boundaries. For Experiments #1 to #8, C = 10, gamma = 0.1, and a linear kernel consistently delivered high accuracy (0.947–0.965), with Experiments #2, #3, and #7 reaching 0.965, indicating that a higher C value provided better generalization. For feature noise experiments (#n2–#n7), SVM remained stable, achieving up to 0.982 in Experiment #n7, showing robustness against feature corruption. However, in label noise experiments (#t2–#t7), accuracy dropped significantly to 0.509, with the rbf kernel performing equally across all cases, demonstrating SVM’s sensitivity to mislabeled data despite different hyperparameter settings. SVM showed a strong performance across most experiments, random and feature noise, where the combination of C = 10, gamma = 0.1, and a linear kernel provided high accuracy. SVM is sensitive to label noise, with a low performance when target labels are corrupted.

7.9.5. KNN Hyperparameters Optimization

KNN hyperparameters optimization focused on tuning n_neighbors (3, 5, 7), weights (“uniform”, “distance”), and p (1, 2) for distance metric power to enhance classification performance. For Experiments #1 to #8, n_neighbors = 5 or 7, p = 1, and “distance” weights consistently produced high accuracy (0.947–0.965), with Experiments #3 to #7 reaching 0.965, indicating that reducing neighbors slightly improved performance. For feature noise experiments (#n2–#n7), accuracy remained stable (0.947–0.965), demonstrating KNN’s robustness against feature corruption. However, in label noise experiments (#t2–#t7), accuracy dropped to 0.561 across all cases, showing some resilience, but still a significant decline due to mislabeled data. KNN performed well in experiments with feature noise with the combination of n_neighbors = 5, distance weight function, and p = 1. Performance was not good in experiments with label noise and showed sensitivity to noisy target data.

7.9.6. CNN Hyperparameters Optimization

CNN hyperparameters optimization focused on kernel_size (3, 5, 7), dropout_rate (0.1–0.5), and learning_rate (0.0001, 0.001, 0.01) to enhance spatial and hierarchical pattern learning in tabular data. For Experiments #1 to #8, CNN achieved high accuracy (0.877–1.000), with kernel_size = 3 or 5, dropout_rate = 0.3–0.5, and learning_rate = 0.01 yielding optimal results. Experiment #1 reached 1.000 accuracy, showing that a kernel_size of 5 and dropout_rate of 0.4 were particularly effective. For feature noise experiments (#n2–#n7), CNN remained robust (0.965–1.000 accuracy), indicating strong resilience to corrupted features. However, label noise experiments (#t2–#t7) significantly reduced accuracy (0.404–0.491), demonstrating CNN’s sensitivity to mislabeled data, despite dropout tuning and learning rate adjustments. CNN showed a very good performance on experiments, with corrupted features with values of kernel_size = 3 and dropout_rate = 0.2. It is robust on feature noise. On label noise, the performance is low.

7.9.7. Transformer Hyperparameters Optimization

Transformer hyperparameters optimization focused on dropout_rate (0.1–0.5) and learning_rate (0.001, 0.01, 0.1) for both the embedding extraction and dense network stages. For Experiments #1 to #8, accuracy remained between 0.895 and 0.947, with dropout_rate = 0.1 and learning_rate = 0.01 yielding the best results (0.947 in Experiment #6). Overall, a lower dropout rate (0.1) performed more consistently across different dataset versions. For feature noise experiments (#n2–#n7), accuracy remained stable at 0.895–0.912, indicating that Transformers were resilient to feature corruption, but did not show major improvements with different dropout settings. However, in label noise experiments (#t2–#t7), accuracy dropped drastically to 0.474–0.509, proving that Transformers are highly sensitive to mislabeled data, with no significant improvements from hyperparameter tuning. The Transformer model demonstrated good performance across experiments in corrupted features, and has high accuracy with optimal settings of dropout_rate = 0.1 and learning_rate = 0.01. The transformer model is robust to feature noise. The performance is low on label noise.

7.10. Discussion

In Experiment #1, the ML and DNN models were fed with original data; the expectation was for metrics from Table 3 to decrease, as long as the content of features was affected by corrupted data. As a paradox, the obtained results, both for RF, XGB, and DNN, for Experiment #4, #5, #6, and #7, were very good, although the datasets were affected considerably by false data. The notable results were given by Experiment #8, when the RF, XGB, and DNN were supplied only original data and the corrupted were eliminated (set 8). In this case, the DNN and XGB give the worst classification values for Acc (0.895), F1-score (0.923), and AUC (0.868). However, the classification of classes depends more on the important features, and as is shown in Table 3, there are area and perimeter values, with the exception of Experiment #1, #2, and #3 for XGB classifiers. In all experiments, the RF has the best stability in the classification process, and this aspect is confirmed by MCC metrics. The poor results were obtained by DNN and XGB from Experiment #8.
Based on the above eight experiments, we can answer the questions RQ1–RQ5, and the answers are marked as RA1-RA5.
RA1: Regarding RQ1, we can notice that corrupted data cannot be correlated with original data; from Figure 3a, Figure 4a, Figure 5a, Figure 6a, Figure 7a, Figure 8a and Figure 9a, where correlation matrices are designed, the correlations between these features are a very weak positive correlation or very weak negative correlation.
RA2: The answer to RQ2 is that the false data can influence the classifiers in choosing important features, as is seen in Table 2. When the concave point features are replaced with false data in Experiment #4, this is not selected as an important feature in the other experiments. In Experiment #5, the concavity feature is replaced with corrupted data, and the same is not selected in the other experiments. In Experiment #8, when all false data are eliminated, the area and perimeter remain the important features. In other words, a feature is removed as important when it is replaced with false data; this observation is extracted from Experiments #1, #2, #3, and #4, for the XGB classifier.
RA3: The classification with DNN is not affected by false data; we can see in Experiments #6 and #7 that the accuracy increased and that the method is validated by F1-score and AUC. So, a DNN can support 60% (six from ten features) of data to be corrupted. Experiment #8 gave the worst result when false features were removed; therefore, as a conclusion, the DNN learned better when the input set contained false data.
RA4: In Table 2 and Table 3, it is found that the false data have the same effect as in the DNN case; XGB is less vulnerable and the accuracy decreases when the amount of corrupted data is 60%, as in the case of Experiment #7. This classifier drops the accuracy as long as the percentages of false data increase; an exception is Experiment #3.
RA5: The noise applied to data has a significant impact on the overall performance of ML algorithms. If we compare the results obtained for feature noise with target noise, the latter category influences the classification more than the first category. Thus, it is shown in Table 5 that the corruption of target features leads to low accuracy for all ML classifiers and neural networks.
RA6: The metrics used are Acc, F1-score, MCC, and AUC; these are extracted from confusion matrices. These are enough, because for a model, the measure of performance and distinguishing between classes are needed.
Preliminary solutions: The study employed a comprehensive methodology to illustrate the sensitivity of AI systems by incrementally introducing false data, label noise, and target noise. The behaviors of the AI algorithms in the subsequent planned experiments are discussed.
An overview observation indicates that false data significantly impact classification performance, as evidenced by experiments and classifiers, yielding an average accuracy of 0.936. Compared to the noise data (average accuracy of 0.942), label noise leads to a misclassification (average accuracy of 0.432). The last experiments #8 (when all manipulated features are eliminated) and #7t (contained noise features) all proposed AI-provided low accuracy.
The Transformer Neural Network is an unperformed classifier for all category sets. It performed the worst when fed with set-n6 and set8 (accuracy of 0.877). Also, the XGB classifier performs poorly for set8 (accuracy of 0.895) and set-n7 (accuracy of 0.912). KNN had uniform behaviors; this classifier gives the same accuracy for all set-ni, where i = 2, 3, …, 7, and sets with false data (accuracy of 0.930). The RF had the same behaviors (accuracy of 0.947) except set-n3 and set-n4 (accuracy of 0.945).
CNN is the most robust classifier for noise feature data. For all set-ni, i = 2, 3, …, 7, the CNN provides the best accuracy of 0.982, not being affected by quantity of noise. When the CNN was fed with false data, the accuracy decreased proportionally with the quantity of false data, and in the following, the set and accuracy are specified: (set 1, 2, 3; 0.985), (set4; 0.965), (set5; 0.947), and (set6; 0.93). DNN succeeds the performance of the CNN; it provides an accuracy of 0.965 for set-n2, set-n4, and set-n7, and set4, set6, and set7.
Future directions: One future study may corroborate the models’ robustness using alternative datasets to evaluate the consistency of the findings across different fields.
Other future directions could be focused on refining explanations computed using existing XAI algorithms, such as the Local Interpretable Model Agnostic Explanation (LIME) and SHapley Additive exPlanations (SHAP).
Limitations: Our study integrated an original idea, which we could not compare with others from the scientific literature. While many of the ML models are applied in various fields, these can fail and/or not be reliable, which is a limitation of our study. The other one is related to the number of ML models used; it is possible for other ML models to be more vulnerable to synthetic data, so that in future research, we will propose the use of AutoML and neural networks.

8. Conclusions

We conducted this study because the medical datasets needed to be properly and thoroughly reviewed. In this survey, we have presented a detailed description of false and noise data and their impacts on diagnostic classification in the BCWD.
The results show that, in contrast to what we would have expected, sometimes the behavior of ML models and neural networks are better when the amount of false data grows, or the noise is added on labels and features. A notable observation is that the CNN is the most robust AI algorithm, since it consistently yields superior results in the presence of false and feature noise in the majority of experiments, according to the provided criteria. The proposed AI algorithms fail to classify the label noise. Our contributions offer new perspectives on the migration of important features and corrupted data, providing novel explanations for the behavior of ML models and DNNs within the setting of high-quality data.

Author Contributions

Conceptualization, S.M. and D.M.; methodology, S.M. and D.M.; software, S.M. and D.M.; validation, S.M., D.M. and C.S.; formal analysis, S.M.; investigation, S.M.; resources, C.S.; data curation, C.S.; writing—original draft preparation, S.M.; writing—review and editing, D.M.; visualization, S.M.; supervision, S.M.; project administration, S.M.; funding acquisition, C.S. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by scientific research contract 825/30.09.2024 “Research Study on the Use of AI Techniques for Sensitivity Testing with Synthetic Data” from Dunarea de Jos University of Galati. This work was supported by scientific research contract 826/30.09.2024 “Research Study on the Applicability of Explainable Artificial Intelligence and Synthetic Data in the Fields of Medicine or Agriculture” from Dunarea de Jos University of Galati.

Data Availability Statement

Datasets are available at https://github.com/simonamoldovanu/corrupted_data/issues/1 (accessed on 12 January 2025).

Acknowledgments

This work was supported by scientific research contract 825/30.09.2024 “Research Study on the Use of AI Techniques for Sensitivity Testing with Synthetic Data” from Dunarea de Jos University of Galati. This work was supported by scientific research contract 826/30.09.2024 “Research Study on the Applicability of Explainable Artificial Intelligence and Synthetic Data in the Fields of Medicine or Agriculture” from Dunarea de Jos University of Galati.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Pichugin, Y.A.; Malafeyev, O.A.; Rylow, D.; Zaitseva, I. A statistical method for corrupt agents detection. In AIP Conference Proceedings; AIP Publishing: Melville, NY, USA, 2018. [Google Scholar]
  2. Zhu, Z.; Dong, Z.; Liu, Y. Detecting corrupted labels without training a model to predict. In Proceedings of the 39th International Conference on Machine Learning, PMLR, Baltimore, MD, USA, 17–23 July 2022; pp. 27412–27427. [Google Scholar]
  3. Rankin, D.; Black, M.; Bond, R.; Wallace, J.; Mulvenna, M.; Epelde, G. Reliability of Supervised Machine Learning Using Synthetic Data in Health Care: Model to Preserve Privacy for Data Sharing. JMIR Med. Inform. 2020, 8, e18910. [Google Scholar] [CrossRef]
  4. Torfi, A.; Fox, E.A.; Reddy, C.K. Differentially private synthetic medical data generation using convolutional GANs. Inf. Sci. 2022, 586, 485–500. [Google Scholar] [CrossRef]
  5. Goncalves, A.; Ray, P.; Soper, B.; Stevens, J.; Coyle, L.; Sales, A.P. Generation and evaluation of synthetic patient data. BMC Med. Res. Methodol. 2020, 20, 108. [Google Scholar] [CrossRef]
  6. Banville, H.; Wood, S.U.; Aimone, C.; Engemann, D.A.; Gramfort, A. Robust learning from corrupted EEG with dynamic spatial filtering. NeuroImage 2022, 251, 118994. [Google Scholar] [CrossRef] [PubMed]
  7. Budach, L.; Feuerpfeil, M.; Ihde, N.; Nathansen, A.; Noack, N.; Patzlaff, H.; Harmouch, H.; Naumann, F. The Effects of Data Quality on Machine Learning Performance. arXiv 2022, arXiv:2207.14529. [Google Scholar]
  8. Wu, Z.; Rincon, D.; Luo, J.; Christofides, P.D. Machine learning modeling and predictive control of nonlinear processes using noisy data. AIChE J. 2021, 67, e17164. [Google Scholar] [CrossRef]
  9. Alhajeri, M.S.; Abdullah, F.; Wu, Z.; Christofides, P.D. Physics-informed machine learning modeling for predictive control using noisy data. Chem. Eng. Res. Des. 2022, 186, 34–49. [Google Scholar] [CrossRef]
  10. Lee, Y.; Barber, R.F. Binary classification with corrupted labels. Electron. J. Stat. 2022, 16, 1367–1392. [Google Scholar] [CrossRef]
  11. Feldman, S.; Einbinder, B.S.; Bates, S.; Angelopoulos, A.N.; Gendler, A.; Romano, Y. Conformal prediction is robust to dispersive label noise. In Proceedings of the Conformal and Probabilistic Prediction with Applications, Limassol, Cyprus, 13–15 September 2023; Volume 186, pp. 34–49. [Google Scholar]
  12. Hendrycks, D.; Mazeika, M.; Wilson, D.; Gimpel, K. Using trusted data to train deep networks on labels corrupted by severe noise Advances in Neural Information Processing Systems. Adv. Neural Inf. Process. Syst. 2018, 31, 10456–11046. [Google Scholar]
  13. Kadhim, R.R.; Kamil, M.Y. Comparison of breast cancer classification models on Wisconsin dataset. Int. J. Reconfigurable Embed. Syst. 2022, 2089, 4864. [Google Scholar] [CrossRef]
  14. Mohammad, W.; Teete, R.; Al-Aaraj, H.; Rubbai, Y.; Arabyat, M. Diagnosis of breast cancer pathology on the Wisconsin dataset with the help of data mining classification and clustering techniques. Appl. Bionics Biomech. 2022, 9, 6187275. [Google Scholar] [CrossRef] [PubMed]
  15. Abdulkareem, S.A.; Abdulkareem, Z.O. An evaluation of the Wisconsin breast cancer dataset using ensemble classifiers and RFE feature selection. Int. J. Sci. Basic Appl. Res. 2021, 55, 67–80. [Google Scholar]
  16. El-Shair, Z.A.; Sánchez-Pérez, L.A.; Rawashdeh, S.A. Comparative Study of Machine Learning Algorithms Using a Breast Cancer Dataset. In Proceedings of the 2020 IEEE International Conference on Electro Information Technology (EIT), Chicago, IL, USA, 31 July–1 August 2020; pp. 500–508. [Google Scholar]
  17. Sujon, M.A.H.; Mustafa, H. Comparative Study of Machine Learning Models on Multiple Breast Cancer Datasets. Int. J. Adv. Sci. Comput. Eng. 2023, 5, 15–24. [Google Scholar] [CrossRef]
  18. Hernández-Julio, Y.F.; Díaz-Pertuz, L.A.; Prieto-Guevara, M.J.; Barrios-Barrios, M.A.; Nieto-Bernal, W. Intelligent fuzzy system to predict the wisconsin breast cancer dataset. Int. J. Environ. Res. Public Health 2023, 20, 5103. [Google Scholar] [CrossRef]
  19. Jony, A.; Arnob, A.K. Deep Learning Paradigms for Breast Cancer Diagnosis: A Comparative Study on Wisconsin Diagnostic Dataset. Malays. J. Sci. Adv. Technol. 2024, 4, 109–117. [Google Scholar] [CrossRef]
  20. Mushtaq, Z.; Qureshi, M.F.; Abbass, M.J.; Al-Fakih, S.M.Q. Effective kernel-principal component analysis based approach for wisconsin breast cancer diagnosis. Electron. Lett. 2023, 59, e212706. [Google Scholar] [CrossRef]
  21. Street, W.N.; Wolberg, W.H.; Mangasarian, O.L. Nuclear feature extraction for breast tumor diagnosis. In Proceedings of the IS&T/SPIE 1993 International Symposium on Electronic Imaging: Science and Technology, San Jose, CA, USA, 31 January–5 February 1993; Volume 1905, pp. 861–870. [Google Scholar]
  22. Hu, J.; Szymczak, S. A review on longitudinal data analysis with random forest. Briefings Bioinform. 2023, 24, bbad002. [Google Scholar] [CrossRef] [PubMed]
  23. Tabacaru, G.; Moldovanu, S.; Răducan, E.; Barbu, M. A Robust Machine Learning Model for Diabetic Retinopathy Classification. J. Imaging 2024, 10, 8. [Google Scholar] [CrossRef] [PubMed]
  24. Damian, F.A.; Moldovanu, S.; Moraru, L. Melanoma detection using a random forest algorithm. In Proceedings of the 2022 E-Health and Bioengineering Conference (EHB), Iasi, Romania, 17–18 November 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 1–4. [Google Scholar]
  25. Khan, M.S.; Nath, T.D.; Hossain, M.M.; Mukherjee, A.; Hasnath, H.B.; Meem, T.M.; Khan, U. Comparison of multiclass classification techniques using dry bean dataset. Int. J. Cogn. Comput. Eng. 2023, 4, 6–20. [Google Scholar]
  26. Bansal, M.; Goyal, A.; Choudhary, A. A comparative analysis of K-Nearest Neighbour, Genetic, Support Vector Machine, Decision Tree, and Long Short Term Memory algorithms in machine learning. Decis. Anal. J. 2022, 3, 10007. [Google Scholar]
  27. Oluleye, B.I.; Chan, D.W.; Antwi-Afari, P. Adopting Artificial Intelligence for enhancing the implementation of systemic circularity in the construction industry: A critical review. Sustain. Prod. Consum. 2022, 35, 509–524. [Google Scholar] [CrossRef]
  28. Trifan, L.S.; Moldovanu, S. Analyzing deep learning algorithms with statistical methods. Syst. Theory Control Comput. J. 2024, 4, 9–14. [Google Scholar] [CrossRef]
  29. LeCun, Y.; Bengio, Y.; Hinton, G. Deep learning. Nature 2015, 521, 436–444. [Google Scholar] [CrossRef] [PubMed]
  30. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. 2017. [Google Scholar]
  31. Starmans, M.P.; van derVoort, S.R.; Tovar, J.M.C.; Veenland, J.F.; Klein, S.; Niessen, W.J. Radiomics: Data mining using quantitative medical image features. In Handbook of Medical Image Computing and Computer Assisted Intervention; Elsevier: Amsterdam, The Netherlands, 2020; pp. 429–456. [Google Scholar]
  32. Baak, M.; Koopman, R.; Snoek, H.; Klous, S. A new correlation coefficient between categorical, ordinal and interval variables with pearson characteristics. Comput. Stat. Data Anal. 2020, 152, 107043. [Google Scholar] [CrossRef]
  33. Ratner, B. The correlation coefficient: Its values range between +1/−1, or do they? J. Target. Meas. Anal. Mark. 2009, 17, 139–142. [Google Scholar] [CrossRef]
  34. Chicco, D.; Tötsch, N.; Jurman, G. The Matthews correlation coefficient (MCC) is more re-liable than balanced accuracy, bookmaker informedness, and markedness in two-class confusion matrix evaluation. BioData Min. 2021, 14, 13. [Google Scholar] [CrossRef]
Figure 1. The flowchart of proposed method.
Figure 1. The flowchart of proposed method.
Bdcc 09 00045 g001
Figure 2. BCWD in academic publishing platforms.
Figure 2. BCWD in academic publishing platforms.
Bdcc 09 00045 g002
Figure 3. Experiment #1: (a) correlation matrix; (b) loss and accuracy curves for training DNN; and (c) feature importance.
Figure 3. Experiment #1: (a) correlation matrix; (b) loss and accuracy curves for training DNN; and (c) feature importance.
Bdcc 09 00045 g003
Figure 4. Experiment #2: (a) correlation matrix; (b) loss and accuracy curves for training DNN; and (c) feature importance.
Figure 4. Experiment #2: (a) correlation matrix; (b) loss and accuracy curves for training DNN; and (c) feature importance.
Bdcc 09 00045 g004
Figure 5. Experiment #3: (a) correlation matrix; (b) loss and accuracy curves for training DNN; and (c) feature importance.
Figure 5. Experiment #3: (a) correlation matrix; (b) loss and accuracy curves for training DNN; and (c) feature importance.
Bdcc 09 00045 g005
Figure 6. Experiment #4: (a) correlation matrix; (b) loss and accuracy curves for training deep DNN; and (c) feature importance.
Figure 6. Experiment #4: (a) correlation matrix; (b) loss and accuracy curves for training deep DNN; and (c) feature importance.
Bdcc 09 00045 g006
Figure 7. Experiment #5: (a) correlation matrix; (b) loss and accuracy curves for training DNN; and (c) feature importance.
Figure 7. Experiment #5: (a) correlation matrix; (b) loss and accuracy curves for training DNN; and (c) feature importance.
Bdcc 09 00045 g007
Figure 8. Experiment #6: (a) correlation matrix; (b) loss and accuracy curve for training DNN; and (c) feature importance.
Figure 8. Experiment #6: (a) correlation matrix; (b) loss and accuracy curve for training DNN; and (c) feature importance.
Bdcc 09 00045 g008
Figure 9. Experiment #7: (a) correlation matrix; (b) loss and accuracy curves for training deep DNN; and (c) feature importance.
Figure 9. Experiment #7: (a) correlation matrix; (b) loss and accuracy curves for training deep DNN; and (c) feature importance.
Bdcc 09 00045 g009
Figure 10. Experiment #8: (a) correlation matrix; (b) loss and accuracy curves for training deep DNN; and (c) feature importance.
Figure 10. Experiment #8: (a) correlation matrix; (b) loss and accuracy curves for training deep DNN; and (c) feature importance.
Bdcc 09 00045 g010
Figure 11. Graph of ACC, F1-score, and AUC along eight experiments with false data for KNN, XGB, RF, SVM, DNN, CNN, and transformer.
Figure 11. Graph of ACC, F1-score, and AUC along eight experiments with false data for KNN, XGB, RF, SVM, DNN, CNN, and transformer.
Bdcc 09 00045 g011aBdcc 09 00045 g011b
Figure 12. Graph of ACC, F1-score, and AUC along six experiments with noise on feature data for KNN, XGB, RF, SVM, DNN, CNN, and transformer.
Figure 12. Graph of ACC, F1-score, and AUC along six experiments with noise on feature data for KNN, XGB, RF, SVM, DNN, CNN, and transformer.
Bdcc 09 00045 g012aBdcc 09 00045 g012b
Figure 13. Graph of ACC, F1-score, and AUC along six experiments with noise on target data for KNN, XGB, RF, SVM, DNN, CNN, and transformer.
Figure 13. Graph of ACC, F1-score, and AUC along six experiments with noise on target data for KNN, XGB, RF, SVM, DNN, CNN, and transformer.
Bdcc 09 00045 g013aBdcc 09 00045 g013b
Table 1. The datasets for testing ML and DNN.
Table 1. The datasets for testing ML and DNN.
Feature
Dataset
DiagnosisRadiusTexturePerimeterAreaSmoothnessCompactnessConcavityConcave PointsSymmetryFD
Set 1-----------
Set 2----------*
Set 3---------**
Set 4--------***
Set 5-------****
Set 6------*****
Set 7-----******
Set 8-----These features are excluded
Table 2. The parameters of ML and DNN architecture.
Table 2. The parameters of ML and DNN architecture.
AI AlgorithmsParameters
XGBmax_depth = 3, learning_rate = 0.1, n_estimators = 100, gamma = 0
RFn_estimators = 100, criterion = ‘gini’
SVMkernel = “linear”, gamma = 0.5
KNNn_neighbors = 3
DNNDense (16, activation = ‘relu’, input_dim = 4, 5, …, 10,
depending on experiment)
Dense (8, activation = ‘relu’)
Dense (1, activation = ‘sigmoid’)
Optimization method: Adam
CNNConv1D: 32 filters, kernel size 3, activation ReLU. Dense Layers: 128, Dropout (0.3), 64, Output (1, sSigmoid), Optimizer: Adam (LR = 0.001), Loss: Binary Cross-entropy, epochs/batch size: 20/32.
TransformerTokenizer/Model: Bert-base-uncased, max length 32 tokens. Dense Layers: 128, Dropout (0.3), 64, Output (1, sigmoid). Optimizer: Adam (LR = 0.001). Loss: Binary Cross-entropy, epochs/batch size: 20/32.
Table 3. Important features throughout experiments.
Table 3. Important features throughout experiments.
ExperimentsRadiusTexturePerimeterAreaSmoothnessCompactnessConcavityConcave PointsSymmetryFD
Experiment #1RF-**-------
XGB------*---
Experiment #2RF*-**--**--
XGB------*---
Experiment #3RF*-**--**--
XGB------*---
Experiment #4RF*-**-*---*
XGB--**-*----
Experiment #5RF*****-----
XGB--**-----*
Experiment #6RF*****----*
XGB--**------
Experiment #7RF****-----*
XGB-**-------
Experiment #8RF--**These features are excluded
XGB--**
Table 4. Results for corrupted features (random values for subunitary features).
Table 4. Results for corrupted features (random values for subunitary features).
ExperimentsClassifiersAccuracyF1-ScoreAUCMCCConfusion MatricesTraining Time (s)Test Time (s)
Experiment #1DNN0.930.950.9160.877[[38 2][2 15]]5.6740.104
XGB0.930.9490.9080.873[[37 3][1 16]]0.0320.003
RF0.9470.9620.9320.877[[38 2][1 16]]0.1340.004
SVM0.9650.9750.9580.916[[39 1][1 16]]1.2230.001
KNN0.9300.9490.9080.841[[37 3][1 16]]0.0010.003
CNN0.9820.9770.9860.963[[35 1][0 21]]1.9880.104
Transformer0.9120.8650.8810.818[[36 0][5 16]]1.7470.070
Experiment #2DNN0.9470.9620.9320.841[[38 2][1 16]]5.6080.104
XGB0.9470.9620.9440.832[[39 1][2 15]]0.0300.004
RF0.9470.9620.9320.877[[38 2][1 16]]0.1340.004
SVM0.9650.9750.9580.916[[39 1][1 16]]0.7740.001
KNN0.9300.9490.9080.841[[37 3][1 16]]0.0010.003
CNN0.9820.9770.9860.963[[35 1][0 21]]1.9910.100
Transformer0.9120.8780.9010.810[[34 2][3 18]]1.7770.082
Experiment #3DNN0.930.9490.9080.916[[37 3][1 16]]5.4250.096
XGB0.930.950.9160.795[[38 2][2 15]]0.0290.003
RF0.9470.9620.9320.916[[38 2][1 16]]0.1360.004
SVM0.9470.9620.9320.877[[38 2][1 16]]0.9380.001
KNN0.9300.9490.9080.841[[37 3][1 16]]0.0010.003
CNN0.9820.9770.9860.963[[35 1][0 21]]1.9360.113
Transformer0.9120.8650.8810.818[[36 0][5 16]]1.7740.080
Experiment #4DNN0.9650.9750.9580.877[[39 1][1 16]]5.4760.097
XGB0.9120.9370.8910.806[[37 3][2 15]]0.0290.003
RF0.9650.9750.9580.916[[39 1][1 16]]0.1370.004
KNN0.9650.9750.9580.916[[39 1][1 16]]1.1750.001
SVM0.9300.9490.9080.841[[37 3][1 16]]0.0010.003
CNN0.9650.9550.9720.929[[34 2][0 21]]2.0140.118
Transformer0.9120.8650.8810.818[[36 0][5 16]]1.7990.083
Experiment #5DNN0.9470.9620.9320.916[[38 2][1 16]]5.6820.097
XGB0.9120.9350.8860.806[[36 4][1 16]]0.0310.003
RF0.9650.9750.9580.916[[39 1][1 16]]0.1390.004
KNN0.9470.9620.9320.877[[38 2][1 16]]1.3150.001
SVM0.9300.9490.9080.841[[37 3][1 16]]0.0010.003
CNN0.9470.9300.9480.889[[34 2][1 20]]2.0120.096
Transformer0.9300.9000.9150.849[[35 1][3 18]]1.7650.088
Experiment #6DNN0.9650.9750.9580.916[[39 1][1 16]]5.4020.096
XGB0.9120.9350.8860.759[[36 4][1 16]]0.0320.004
RF0.9650.9750.9580.916[[39 1][1 16]]0.1420.004
KNN0.9470.9620.9320.877[[38 2][1 16]]1.5990.001
SVM0.9300.9490.9080.841[[37 3][1 16]]0.0010.003
CNN0.9300.9050.9250.849[[34 2][2 19]]1.9810.115
Transformer0.9120.8650.8810.818[[36 0][5 16]]1.8000.074
Experiment #7DNN0.9650.9750.9580.759[[39 1][1 16]]5.3370.094
XGB0.8950.9320.8680.759[[36 4][2 15]]0.0340.003
RF0.9650.9750.9580.877[[39 1][1 16]]0.1480.004
KNN0.9650.9750.9580.916[[39 1][1 16]]1.3120.001
SVM0.9300.9490.9080.841[[37 3][1 16]]0.0010.003
CNN0.8950.8420.8670.774[[35 1][5 16]]2.0120.115
Transformer0.8950.8330.8570.782[[36 0][6 15]]1.8370.084
Experiment #8DNN0.8950.9230.8680.877[[36 4][2 15]]5.4440.095
XGB0.8950.9230.8680.873[[36 4][2 15]]0.0280.002
RF0.9470.9620.9320.877[[38 2][1 16]]0.1240.004
KNN0.9470.9620.9320.877[[38 2][1 16]]0.8760.001
SVM0.9300.9490.9080.841[[37 3][1 16]]0.0010.003
CNN0.9120.8720.8910.811[[35 1][4 17]]1.9820.115
Transformer0.8770.8370.8730.739[[32 4][3 18]]1.7680.080
Table 5. Results for feature noise and seven AI algorithms.
Table 5. Results for feature noise and seven AI algorithms.
ExperimentsClassifiersAccuracyF1-ScoreAUCMCCConfusion MatricesTraining Time (s)Test Time (s)
Experiment #n2DNN0.9650.9750.9580.916[[39 1][1 16]]5.5280.093
XGB0.9120.9370.8910.795[[37 3][2 15]]0.0300.004
RF0.9470.9620.9320.877[[38 2][1 16]]0.1330.004
SVM0.9470.9620.9320.877[[38 2][1 16]]1.4270.001
KNN0.9300.9490.9080.841[[37 3][1 16]]0.0010.003
CNN0.9820.9770.9860.963[[35 1][0 21]]2.0610.116
Transformer0.9300.9000.9150.849[[35 1][3 18]]1.7580.085
Experiment #n3DNN0.9300.9490.9080.841[[37 3][1 16]]5.6480.093
XGB0.9120.9370.8910.795[[37 3][2 15]]0.0290.003
RF0.9650.9750.9580.916[[39 1][1 16]]0.1330.004
SVM0.9470.9620.9320.877[[38 2][1 16]]0.7790.001
KNN0.9300.9490.9080.841[[37 3][1 16]]0.0010.003
CNN0.9820.9770.9860.963[[35 1][0 21]]1.9920.129
Transformer0.8950.8330.8570.782[[36 0][6 15]]1.7970.085
Experiment #n4DNN0.9650.9750.9580.916[[39 1][1 16]]5.4490.099
XGB0.9120.9370.8910.795[[37 3][2 15]]0.0290.003
RF0.9650.9750.9580.916[[39 1][1 16]]0.1320.004
KNN0.9300.9490.9080.841[[37 3][1 16]]0.5500.001
SVM0.9650.9750.9580.916[[39 1][1 16]]0.0010.003
CNN0.9820.9770.9860.963[[35 1][0 21]]2.0200.112
Transformer0.9120.8650.8810.818[[36 0][5 16]]1.8070.085
Experiment #n5DNN0.9300.9490.9080.841[[37 3][1 16]]5.4250.093
XGB0.9120.9370.8910.795[[37 3][2 15]]0.0300.003
RF0.9470.9620.9320.877[[38 2][1 16]]0.1310.004
KNN0.9300.9490.9080.841[[37 3][1 16]]0.8220.001
SVM0.9650.9750.9580.916[[39 1][1 16]]0.0010.003
CNN0.9820.9770.9860.963[[35 1][0 21]]1.9960.114
Transformer0.9120.8720.8910.811[[35 1][4 17]]1.7660.084
Experiment #n6DNN0.9650.9750.9580.916[[39 1][1 16]]5.3790.095
XGB0.9120.9370.8910.795[[37 3][2 15]]0.0290.003
RF0.9470.9620.9320.877[[38 2][1 16]]0.1330.004
KNN0.9300.9490.9080.841[[37 3][1 16]]0.5600.001
SVM0.9650.9750.9580.916[[39 1][1 16]]0.0010.003
CNN0.9820.9770.9860.963[[35 1][0 21]]1.9960.113
Transformer0.8770.8370.8730.739[[32 4][3 18]]1.8090.083
Experiment #n7DNN0.9470.9630.9440.873[[39 1][2 15]]5.4030.095
XGB0.9120.9370.8910.795[[37 3][2 15]]0.0360.004
RF0.9470.9620.9320.877[[38 2][1 16]]0.1320.004
KNN0.9300.9490.9080.841[[37 3][1 16]]1.0970.001
SVM0.9820.9880.9880.958[[40 0][1 16]]0.0010.003
CNN0.9820.9770.9860.963[[35 1][0 21]]2.0200.122
Transformer0.9120.8650.8810.818[[36 0][5 16]]1.7930.084
Table 6. Results for label noise and seven AI algorithms.
Table 6. Results for label noise and seven AI algorithms.
ExperimentsClassifiersAccuracyF1-ScoreAUCMCCConfusion MatricesTraining Time (s)Test Time (s)
Experiment #t2DNN0.3160.3160.355−0.289[[9 29][10 9]]5.4450.097
XGB0.4210.5220.406−0.199[[18 20][13 6]]0.0440.003
RF0.3680.4710.359−0.298[[16 22][14 5]]0.2110.004
SVM0.3510.3730.387−0.231[[11 27][10 9]]4.3970.001
KNN0.5640.6130.5720.152[[19 19][7 12]]0.0010.003
CNN0.4390.4290.439−0.122[[13 17][15 12]]2.0260.114
Transformer0.5260.2700.5090.024[[25 5][22 5]]1.8260.085
Experiment #t3DNN0.2980.3330.327−0.357[[10 28][12 7]]5.6610.094
XGB0.4210.5220.406−0.199[[18 20][13 6]]0.0450.003
RF0.4390.5560.403−0.202[[20 18][14 5]]0.2140.004
SVM0.3510.3510.395−0.211[[10 28][9 10]]5.6940.001
KNN0.5640.6130.5720.152[[19 19][7 12]]0.0010.003
CNN0.4390.2380.426−0.168[[20 10][22 5]]1.9810.113
Transformer0.5090.6220.5260.068[[6 24][4 23]]1.7600.084
Experiment #t4DNN0.3510.3510.395−0.211[[10 28][11 8]]5.4840.097
XGB0.4210.5220.406−0.199[[18 20][13 6]]0.0470.005
RF0.4210.5220.406−0.199[[18 20][13 6]]0.2130.004
KNN0.5640.6130.5720.152[[19 19][7 12]]3.7520.001
SVM0.3510.3930.379−0.253[[12 26][11 8]]0.0010.003
CNN0.4210.3770.419−0.163[[14 16][17 10]]1.9850.109
Transformer0.5090.5170.5110.022[[14 16][12 15]]1.7770.085
Experiment #t5DNN0.3680.4190.392−0.226[[13 25][11 8]]5.4540.096
XGB0.4210.5220.406−0.199[[18 20][13 6]]0.0440.003
RF0.3860.4930.370−0.274[[17 21][14 5]]0.2170.004
KNN0.5640.6130.5720.152[[19 19][7 12]]7.5740.001
SVM0.3680.4000.401−0.204[[12 26][10 9]]0.0010.003
CNN0.3510.3270.350−0.300[[11 19][18 9]]2.0200.108
Transformer0.5090.5330.5130.026[[13 17][11 16]]1.7930.087
Experiment #t6DNN0.2630.2760.295−0.416[[8 30][12 7]]5.4100.094
XGB0.4210.5220.406−0.199[[18 20][13 6]]0.0450.003
RF0.3510.4480.348−0.323[[15 23][14 5]]0.2150.004
KNN0.5640.6130.5720.152[[19 19][7 12]]10.4610.001
SVM0.3510.3020.410−0.169[[8 30][7 12]]0.0010.003
CNN0.4040.3700.402−0.196[[13 17][17 10]]2.0420.113
Transformer0.4560.4360.456−0.089[[14 16][15 12]]1.8030.085
Experiment #t7DNN0.2980.2590.341−0.304[[7 31][9 10]]5.4130.094
XGB0.4210.5220.406−0.199[[18 20][13 6]]0.0470.004
RF0.3680.4860.346−0.325[[17 21][15 4]]0.2160.004
KNN0.5640.6130.5720.152[[19 19][7 12]]5.2250.001
SVM0.3860.4440.405−0.200[[14 24][11 8]]0.0010.003
CNN0.4390.4480.441−0.119[[12 18][14 13]]2.0090.108
Transformer0.4910.5920.5060.013[[7 23][6 21]]1.7670.097
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Moldovanu, S.; Munteanu, D.; Sîrbu, C. Impact on Classification Process Generated by Corrupted Features. Big Data Cogn. Comput. 2025, 9, 45. https://doi.org/10.3390/bdcc9020045

AMA Style

Moldovanu S, Munteanu D, Sîrbu C. Impact on Classification Process Generated by Corrupted Features. Big Data and Cognitive Computing. 2025; 9(2):45. https://doi.org/10.3390/bdcc9020045

Chicago/Turabian Style

Moldovanu, Simona, Dan Munteanu, and Carmen Sîrbu. 2025. "Impact on Classification Process Generated by Corrupted Features" Big Data and Cognitive Computing 9, no. 2: 45. https://doi.org/10.3390/bdcc9020045

APA Style

Moldovanu, S., Munteanu, D., & Sîrbu, C. (2025). Impact on Classification Process Generated by Corrupted Features. Big Data and Cognitive Computing, 9(2), 45. https://doi.org/10.3390/bdcc9020045

Article Metrics

Back to TopTop