1. Introduction
The rapid development of artificial intelligence tools, the widespread use of Internet of Things technologies, and the rapid growth of the computing power of modern hardware satisfy all the prerequisites for the use of intellectual analysis in various applications. It is also facilitated by the collection and preservation of large arrays of different types of data for research [
1].
The data mining methodology includes three main steps: preprocessing the collected data, selecting and applying the optimal machine learning model for their analysis, and evaluating the result [
2].
Data preprocessing is the first, and perhaps the most critical, step in the further analysis of such data. Effectively performing preprocessing tasks is essential to improving the accuracy of classifiers and regressors based on such data [
3,
4,
5]. Numerical data preprocessing tasks include data consolidation, deduplication, data imputation, detection and removal of anomalies and outliers, feature selection, and data normalization.
In this article, we investigate the last of these data processing stages. Data normalization transforms the value of a feature in the initial dataset into a given range. The need for such a step is determined by the possible sensitivity of the selected machine learning model to the value of the feature. Thus, a non-normalized dataset can provoke a finding by the chosen machine learning model of false dependencies in the data and, as a result, reduce the efficiency of its work in performing the stated task [
2,
6].
Numerical data normalization is not a new problem. There are many approaches to performing it. Some methods are often used to perform this task. A number of them have been successfully implemented and used by researchers in data mining application packages. Other, more specialized methods are used in some cases. However, the general problem of selecting the optimal strategy for each specific task or dataset or machine learning model in order to obtain the highest accuracy urgently needs to be solved.
The modern development of medical diagnostics is primarily based on data mining. It happens for many reasons, such as [
7]:
The existence of historical data of different volumes intended for analysis;
The need to analyze both enormous and tiny datasets that are difficult for humans to handle;
A large number of features that may affect the patient’s diagnosis and are difficult or impossible for doctors to take into account during diagnosis;
Complex, usually hidden, nonlinear interdependencies between the features of a particular dataset, which are very difficult to identify at first glance but are easily identified and taken into account by a specific machine learning model;
The high classification or prediction accuracy of machine learning models, which exclude human factors and subjectivism and can serve as a source of additional information to the doctor.
All this greatly complicates the application of medical data mining in various fields of medicine.
Despite this, the number of studies developing new and effective diagnostic technologies based on different types and volumes of information about the patient is growing every day [
8,
9]. All of them use a particular procedure for the normalization of the studied dataset. Selection of the optimal algorithm for or approach to data normalization can increase the performance and classification accuracy of machine learning models [
10,
11]. Such a simple procedure can provide a better machine learning model for medical data mining [
12].
The vast majority of existing data normalization methods involve the performance of transformations on the columns of the tabular dataset. Such changes aim to reduce the value of each feature in the studied dataset to some value determined within a specific interval while maintaining the overall data distribution. This approach reduces the sensitivity and, as a result, increases the generalizability of the chosen machine learning model and can also reduce the duration of learning procedures, for example, when the values of significant features are reduced to values in a small interval (e.g., 0:1 or −1:1).
However, as noted above, attributes with complex, hidden, and nonlinear interdependencies characterize medical data processing tasks. These should be taken into account in the machine learning model in order to improve the accuracy of intelligent diagnostic systems. However, most of the existing methods do not yield a dataset that considers these features of medical data processing tasks.
This paper aims to develop a new data normalization method that considers the interdependencies between features in a given dataset and their absolute values. The proposed method should increase the classification accuracy of machine learning methods in the case of medical data processing tasks.
The main contributions of this paper can be summarized as follows:
We develop a new two-step method for tabular data normalization that considers the interdependencies between the features of each observation and the absolute values of each of these features. The proposed method reduces the number of extrapolation problems for vectors at a distance from the training sample;
We demonstrate the high efficiency of Decision Tree and Extra Trees classifiers based on the developed data normalization method for both binary and multiclass classification tasks using different medical datasets;
We experimentally establish an increase in the classification accuracy based on several machine learning methods that use the developed two-step data normalization method compared with other existing methods.
The remainder of the paper is structured as follows.
Section 2 presents the results of a review and critical analysis of existing work on the normalization of tabular datasets.
Section 3 introduces the mathematical basis of five existing data normalization methods. The developed two-step data normalization method for the medical domain is described. The algorithmic procedure for its realization and a visualization of the results are also given.
Section 4 presents the numerical results of the developed method based on six different classifiers using different sets of medical data to perform binary and multiclass medical diagnostics tasks.
Section 5 compares the accuracy of the developed method with that of the existing ones. Our conclusions are presented in
Section 6.
2. The State-of-the-Art
Data normalization is one of the primary tasks of data processing. The performance of machine learning algorithms largely depends on how effectively the data are normalized. In particular, ref. [
13] investigated the influence of different normalization methods on the accuracy of classification techniques. Based on numerous experimental studies, the author identified many techniques that provide high accuracy in classification tasks and those that should not be used to perform such tasks.
This section summarizes research on the use of several normalization methods for numerical sets of medical data and their impact on the accuracy of medical data mining techniques.
In [
14], the efficiency of the k-NN classifier was investigated using different normalization methods. In particular, the authors performed experiments on the use of the Min-Max Scaler and the Standard Scaler in the selected algorithm to perform a multiclass classification task. The simulation was performed on one well-known dataset. It was experimentally established that the Min-Max Scaler provided the k-NN classifier with the highest accuracy when performing the classification task on the Iris dataset.
In [
15], the results obtained in the above-mentioned study were extended. In this case, in addition to the two above-mentioned normalization methods, the authors used Decimal normalization. Moreover, the experimental part of the work analyzed the effectiveness of the application of nine machine learning methods. However, empirical studies on several datasets did not allow the authors to single out a data normalization method that would increase the accuracy of all classifiers. The authors found that the classification accuracy when using the three normalization methods varied depending on the selected classifier. The disadvantage of this study is the small number of classifiers used, which made it impossible to summarize the results on the effectiveness of a particular data normalization method.
The authors of [
16] conducted experimental studies on the influence of four data normalization methods on the accuracy of an adaptive neural fuzzy inference system in performing classification and regression tasks. In addition to the three methods mentioned above, the authors used the Robust Scaler and the Max Abs Scaler. The simulation was performed using just one medical dataset. The results demonstrate that the Min-Max Scaler provided the proposed classifier with the highest accuracy when performing the medical diagnostics task. However, experiments on only one dataset do not provide us with the possibility of generalizing the results obtained.
In [
17], the authors investigated the effectiveness of performing a heart disease classification task based on different methods for normalization using nine machine learning algorithms. In particular, the authors used such normalization methods as the Robust Scaler, the Max Abs Scaler, Normalization, the Min-Max Scaler, the Standard Scaler, and the Quantile Transformer. It was experimentally established that none of the normalization methods significantly affected the effectiveness of each of the nine machine learning algorithms. There were two reasons for this. The first one is that the authors used only one dataset in their modeling. The second one is more important. The methods studied in the paper only transform data in columns. Thus, the interdependencies between the features in the studied medical dataset were not taken into account.
The authors in [
18] considered five data normalization methods, including four from a previous study and the Vector Scaler. The basis of this method is that it takes into account the norm of each vector in order to normalize the dataset by rows to overcome the above-mentioned shortcoming. The authors investigated the influence of normalization methods on multi-criteria decision-making tasks. The effectiveness of each of the studied methods was evaluated using the Pearson’s correlation coefficient. The authors found that the Max Abs Scaler was the most acceptable for the stated task.
In [
19], the authors considered the problem of improving the classification accuracy in medical diagnostics tasks by applying an effective data normalization method. In addition to the commonly used techniques employed in the above-mentioned study, the authors drew attention to the accuracy of classifiers that use the Vector Scaler. This was due to the specific characteristics of medical diagnostics tasks, which are significantly different from those of the task performed in the above-mentioned study. Experimental results on two different datasets using three machine learning methods based on decision trees showed a significant increase in the classification accuracy in the case of using the Vector Scaler compared with the other methods. Despite this fact, such an approach does not consider the absolute values of the features in the normalized dataset. This can lead to some ambiguities that, in turn, will reduce the effectiveness of further medical data mining.
In general, most of the published scientific papers on the effect of data normalization on classification accuracy did not use methods that take into account the interdependencies between the attributes of each vector and their absolute values. However, the importance of this problem has been confirmed by many studies in various fields of biology and medicine [
20,
21,
22].
In this paper, we present a new method for the normalization of numerical sets of medical data that has the advantages of the above-mentioned techniques and, at the same time, eliminates the shortcomings of these techniques in order to improve the classification accuracy of classifiers that perform medical diagnostics tasks.
3. Materials and Methods
In this paper, we present a new two-step data normalization method. It is based on the combined use of the Max Abs Scaler and the Vector Scaler, taking into account some significant differences. Therefore, we consider the principles of operation of the most common data normalization methods for numerical datasets when performing medical data mining tasks (
Table 1).
The first and fourth methods are susceptible to outliers in the dataset, which is a typical characteristic of medical datasets. Additionally, if the data are not normally distributed, these are not the best Scalers to use. The Robust Scaler’s centering and scaling statistics are based on percentiles and are therefore not influenced by a few large marginal outliers. The Standard Scaler assumes that the data are normally distributed within each feature, which in real-world medical datasets is impossible. Unit Vector Scaling considers the whole feature vector to be of unit length. This usually means dividing each component by the Euclidean length of the vector (i.e., using the L2 Norm).
In addition, the first four methods listed in
Table 1 perform only column operations. Accordingly, interdependencies between the features of each vector, which are quite common in medical data, are not considered. The fifth normalization method takes into account this shortcoming. It performs normalization for each vector separately based on the norm of the corresponding vector. However, this method does not consider the absolute values of the normalized dataset.
The two-step data normalization method presented in this paper overcomes these disadvantages.
The Proposed Two-Step Data Normalization Method
The proposed data normalization method considers both the interdependencies between the features of each vector and the absolute values of each of the features in a given medical dataset. The need for this can be explained by the peculiarities of medical diagnostics tasks [
23,
24]. They are characterized by datasets of different volumes, with an asymmetrically represented number of vectors in each problem class. In addition, such datasets are characterized by many additional attributes (e.g., laboratory tests, physician observations) that also have complex, nonlinear, and seemingly unknown interdependencies [
25]. However, considering such interdependencies is essential in diagnosis and therapy or supporting the treatment process [
26]. Existing methods for normalization mainly involve the conversion of data by columns. However, this is insufficient when it is necessary to consider the interdependence between them [
19]. That is why the developed method takes into account the above-mentioned features of medical datasets.
Now, we consider the developed method in more detail. Assume that a medical dataset can be represented as a matrix of features , where each -th vector (line, or observation) can be represented as follows: , where and is the number of vectors (the number of observations in a matrix ).
The algorithmic implementation of the proposed two-step data normalization method involves the sequential execution of the following procedures.
Initial normalization for each
-th column (
) of a given set of tabular data, according to the scheme of the maximal value of the absolute element in each column, according to the following formula:
This step of the proposed method corresponds to normalization according to the second method listed in
Table 1. It can be omitted or replaced by another method that normalizes the data by columns.
Accordingly, as a result of this step, we normalize the entire dataset (if it is one matrix, represented as ). If the dataset before normalization was divided into two datasets (a training dataset and a test dataset), then the first step of the algorithm is performed on the training dataset. Next, the normalization of the test/validation dataset is completed according to the maximal value of the absolute elements for each column that were obtained for the training dataset. The same approach is used for all further steps of the proposed method in the case where the separate normalization of the training and test/validation datasets is needed.
The first step of the developed method for normalization by rows involves:
- 2.
Calculation of the norm of each vector using
from (2) according to the following expression:
- 3.
Normalization of each separate vector
from the dataset, taking into account its norm according to the expression:
As a result, we obtain the normalization of the dataset according to Method 5 from
Table 1. A visualization of the results of the proposed method for the case of a two-dimensional dataset is presented in
Figure 1a.
The main idea is to normalize each vector (row, observation) of a given dataset separately from each other vector. The main advantage is that the normalized dataset considers the interconnections between the attributes of each observation. It is essential that this condition be satisfied in order to improve the efficiency of data mining when performing classification tasks in various fields of medicine.
However, the main disadvantage of this method is that it does not consider the absolute values of each feature. As a result, ambiguities may arise that will significantly affect the performance of the classifiers or regressors that process the dataset in this way.
We propose a second step of transformation that eliminates the above shortcoming. The second step of the proposed data normalization method transforms the data by rows.
Then, we expand each vector (3) of the dataset using each corresponding norm (2):
As a result, we obtain a new vector with an additional input component .
We perform for each extended vector (5) transformations similar to procedure (3):
In this case, we calculate the norm of each extended vector from (5) and normalize each vector for the second time taking into account its new norm.
A visualization of the results of the proposed two-step data normalization method for the case of an initial two-dimensional dataset (
) is presented in
Figure 1b.
As a result of this step, we obtain:
A normalized dataset for each column and each row;
A dataset that has been extended by one additional feature compared with the original, non-normalized dataset;
A dataset that considers both the interdependencies between the features of each separate vector and their absolute values.
If we analyze the results of both normalization methods for the case of the initial two-dimensional dataset, we can obtain the following conclusions. The result of the Vector Scaler normalization method (
Figure 1a) is a set of vectors that lie on a circle of unit radius. This method allows for the interdependence between the attributes of a given dataset to be considered but not their absolute value. If we use the proposed two-step data normalization method on rows (
Figure 1b), the obtained set of normalized vectors will lie on a sphere. This is due to the introduction of an additional component in each vector of the two-dimensional data array; therefore, the visualization occurs in three-dimensional space. In this case, the third component considers the absolute values of the vectors. For example, using two vectors with components (5, 6) and (10, 12) will ensure the possibility of distinguishing them in the normalized dataset. According to the Vector Scaler, the normalized components of both these vectors will be the same. This reduces the informativeness of the whole dataset. In the case of small data, processing them can be a problem. The proposed two-step data normalization method increases the dimensionality of the input data space by adding a third component that considers the absolute values of the vector components. This ensures that the selected classifier will be able to separate these two vectors.
Among the apparent consequences of implementing the proposed approach is that the projection on a sphere will reduce the number of extrapolation problems for vectors at a distance from the training sample. Therefore, applying the proposed two-step data normalization method should increase the classifier’s accuracy when performing various medical diagnostics tasks.
4. Modeling and Results
We developed a software solution to implement the two-step data normalization method using Python [
27]. The simulation of the proposed method was performed using several machine learning methods based on decision trees. We used two boosting machine learning methods, bagging and feature bagging methods, and Decision and Extra Precision Trees methods [
28]. This choice was due to their high accuracy, the possibility of straightforwardly interpreting some of their results, and the widespread use of such methods to perform various technical and medical diagnostics tasks [
29,
30].
The modeling used fixed parameters for each of the machine learning methods used for each of the studied data normalization methods. To easily reproduce the results of this study, we chose the implementation of each of the machine learning methods in the Python library, namely Scikit-learn [
16]. The parameters of the methods used during the modeling are summarized in
Table A1.
The evaluation of the accuracy of the machine learning methods with the proposed data normalization method and the existing data normalization methods was carried out using standard performance indicators. In particular, Accuracy, Precision, Recall, and F1-score were used to assess the effectiveness of the classifiers in performing the tasks [
31,
32].
4.1. Datasets Used for the Modeling
We investigated whether the proposed data normalization method increases the accuracy of classifiers that perform medical diagnostics tasks [
33]. Most of these tasks are formulated as classification tasks with two or more defined classes [
34,
35]. If there are only two classes, which is very common in medical diagnostics tasks, we consider a binary classification task [
36]. If the problem has more than two defined classes, it is a multiclass classification task [
37].
Since both formulations are typical of applied medical diagnostics tasks, we modeled the proposed two-step data normalization method on different medical datasets designed to perform binary and multiclass classification tasks. To do this, we selected three well-known, real-world datasets for the binary classification task and three real-world, well-known datasets for the multiclass classification task. It should be noted that the number of data vectors in each dataset and the number of features in each dataset are different. In addition, the datasets for the multiclass classification task had between three and six classes.
A summary of the datasets used for the modeling and references to the freely available repository where they are located are given in
Table 2.
Each dataset was divided into two datasets: a training dataset (80% of the samples, randomly selected) and a test dataset (the remaining 20% of the samples).
4.2. Results
Table 3 summarizes the results of modeling the proposed two-step data normalization method based on the:
Accuracy score;
Precision score;
Recall score; and
F1-score
using the six different machine learning methods for the six datasets.
As shown in
Table 3, the Extra Decision Tree classifier had the highest accuracy among all the methods considered in most cases. Additionally, in some cases, the Bagging classifier and the Decision Tree classifier showed a good trend. In contrast, the ensemble techniques, the AdaBoost Classifier and the Random Forest Classifier, had low classification accuracy in the stated tasks.
5. Comparison and Discussion
The effectiveness of the proposed data normalization method was evaluated by comparing its accuracy with that of five existing data normalization methods:
Vector Scaler;
Max Abs Scaler;
Min Max Scaler;
Standard Scaler;
Robust Scaler.
The results (the Accuracy score and the F1-score) of the six machine learning methods on the six datasets using the six data normalization methods are summarized in
Table A2 and
Table A3.
Since we considered binary and multiclass classification tasks, the analysis of the results for each was carried out separately.
In performing the binary classification task, some of the machine learning methods, namely the AdaBoost classifier, the Bagging classifier, the Gradient Boosting classifier, and the Random Forest classifier, demonstrated a deterioration in accuracy when using the proposed two-step data normalization method.
The best effect when using the proposed method to perform the binary classification task was obtained by using the two most straightforward machine learning methods (the Decision Tree classifier and the Extra Trees classifier), the results of which are easy to interpret. This advantage is essential when performing medical data mining, where the latest trend is the use of Explainable Artificial Intelligence.
Figure 2 summarizes the results of the use of both these methods (based on the Accuracy score) to perform binary classification tasks using three different datasets and six different data normalization methods.
It should be noted that the numerical values of the Accuracy score accompany the graphical information in two of the six columns. These are the values that were obtained for the proposed method and the most similar method. For the other methods, these numbers are not given so as to not overload the histogram. All accuracy indicators are presented in
Table A1.
Figure 2 summarizes the accuracy of both methods during the completion of the binary classification task using the three different datasets and the six data normalization methods.
As can be seen from
Figure 2, when analyzing the dataset [
38] produced by the Extra Trees classifier, the proposed method showed a 1% increase in accuracy compared with the most similar method (the Vector Scaler) and a 2% increase in accuracy compared with all other normalization methods. Using the Decision Tree classifier, we obtained a 5% increase in accuracy compared with the most similar classifier and a 3% increase in accuracy compared with all other normalization methods.
In the case of the analysis of the dataset from [
39], the Extra Trees classifier based on the proposed method demonstrated a 5% increase in accuracy compared with all other methods. The Decision Tree classifier, in this case, showed an increase in accuracy of more than 1%.
In the third case, when completing a binary classification task based on the dataset from [
40], the Extra Trees classifier based on the proposed method experienced a 2% reduction in accuracy. The Decision Tree classifier showed a 2% increase in accuracy compared with all other methods.
Taken together, the results of the Decision Tree classifier and the Extra Trees classifier based on the six different data normalization methods for the binary classification task indicate that:
The Max Abs Scaler, Min-Max Scaler, Standard Scaler, and Robust Scaler do not provide a significant difference in the accuracy of the investigated classifiers;
The proposed data normalization method provides an increase in the classification accuracy of 1 to 5% compared with the existing methods;
The proposed data normalization method increases the classification accuracy from 1% to 3% compared with the most similar data normalization method (the Vector Scaler).
Let us now consider the results of the comparison between the proposed method and the existing normalization methods in the case of completing multiclass classification tasks based on the three studied datasets and the six different machine learning methods. The results are presented in
Table A2.
In performing multiclass classification tasks, some machine learning methods, namely the AdaBoost classifier, the Gradient Boosting classifier, and the Random Forest classifier, demonstrated a deterioration in accuracy when using the proposed two-step data normalization method. An increase in accuracy with the proposed method, in this case, was achieved using the Bagging classifier, the Decision Tree classifier, and the Extra Trees classifier.
Figure 3 summarizes the accuracy of these methods based on the F1-score during the completion of multiclass classification tasks using the three different datasets and the six different data normalization methods.
As shown in
Figure 3, when completing a multiclass classification task based on the dataset from [
41], four well-known data normalization methods had almost no effect on the accuracy of the classifiers. Here again, the proposed method and the most similar method stand out. For this dataset (the dataset from [
41]), all three machine learning methods experienced a 1% to 6% increase in accuracy due to the proposed data normalization method. Similar results were obtained for the dataset from [
42]. Here, the Bagging classifier and the Extra Trees classifier demonstrated a significant increase in classification accuracy. This can be explained by the fact that the proposed method increases the number of features in the dataset by one, which results in the jump in the accuracy of the classifiers. However, in the case of using the dataset from [
43], the accuracy of the classifiers using all six data normalization methods is almost the same. Only the Bagging classifier showed an increase in accuracy (1%) when using the proposed method.
It should be noted that all variables in the third dataset are categorical, which may explain the generally low accuracy of the machine learning methods applied for its analysis.
Taken together, the results of the Bagging classifier, the Decision Tree classifier, and the Extra Trees classifier based on six different data normalization methods for multiclass classification tasks indicate that:
The Max Abs Scaler, Min-Max Scaler, Standard Scaler, and Robust Scaler affect the accuracy of the investigated classifiers;
The proposed data normalization method provides both a significant (1% to 6%) increase in the accuracy of the classifiers compared with the above-mentioned methods for normalization and the same level of accuracy as the Vector Scaler;
The proposed data normalization method improves the accuracy of the classifiers compared with the most similar data normalization method (the Vector Scaler).
In general, an increase in the accuracy of a classifier of 1% based on only the data normalization method, which is perhaps the first step in data mining, would justify its use in practice. However, increasing the accuracy by 5% in binary classification tasks only by normalizing the data satisfies many prerequisites for using the proposed method in Decision Tree and Extra Trees classifiers that perform various medical diagnostics tasks, particularly in automated robotic systems [
44,
45,
46]. Such a significant increase in the accuracy of the Bagging classifier, the Decision Tree classifier, and the Extra Trees classifier based on the proposed data normalization method in performing multiclass classification tasks also encourages the use of the Extra Trees classifier in practice.
6. Conclusions
In this study, we focused on the problem of effectively preprocessing data to increase the accuracy of intellectual analysis in the case of completing medical diagnostics tasks. We developed a new two-step numerical data normalization method. It is based on the possibility of considering the interdependencies between the features of each observation and the absolute values of each of these features to improve the accuracy of medical data mining techniques.
The proposed approach was modeled using six different classifiers based on machine learning methods for the two cases of binary classification tasks and multiclass classification tasks. Experiments were performed on six real-word, freely available datasets for performing medical diagnostics tasks with different numbers of vectors, attributes, and classes.
We compared the accuracy of the proposed data normalization method with that of five existing methods. It was established that the proposed data normalization method increased the classification accuracy of the Decision Tree classifier and the Extra Trees classifier by 1–5% in the case of performing the binary classification task. In addition, it provided a 1–6% increase in the accuracy of the Bagging classifier, the Decision Tree classifier, and the Extra Trees classifier in the case of performing the multiclass classification task. At the same time, we observed a decrease in the classification accuracy of the AdaBoost classifier, the Gradient Boosting classifier, and the Random Forest classifier when using the proposed normalization method compared with the existing ones in both classification tasks.
Nevertheless, the increase in the accuracy of the Decision Tree classifier and the Extra Trees classifier based only on the proposed data normalization method satisfies all the prerequisites for its use in practice when performing a variety of medical data mining tasks.
Further research will be conducted to assess the accuracy of artificial neural networks [
46,
47,
48], particularly PNN and GRNN, based on the developed two-step data normalization method for the analysis of small datasets. In addition, using the method proposed in this paper, a new data classification method will be developed for imbalanced datasets and the representation of only the vectors of one class in the dataset, which should be recognized given that many vectors have previously not been described. This method will be based on the new committee model of a hypercylinder’s surfaces based on nonlinear SGTM neural-like structures [
49].