Impact of Dataset Size on Classification Performance: An Empirical Evaluation in the Medical Domain

Dataset size is considered a major concern in the medical domain, where lack of data is a common occurrence. This study aims to investigate the impact of dataset size on the overall performance of supervised classification models. We examined the performance of six widely-used models in the medical field, including support vector machine (SVM), neural networks (NN), C4.5 decision tree (DT), random forest (RF), adaboost (AB), and naïve Bayes (NB) on eighteen small medical UCI datasets. We further implemented three dataset size reduction scenarios on two large datasets and analyze the performance of the models when trained on each resulting dataset with respect to accuracy, precision, recall, f-score, specificity, and area under the ROC curve (AUC). Our results indicated that the overall performance of classifiers depend on how much a dataset represents the original distribution rather than its size. Moreover, we found that the most robust model for limited medical data is AB and NB, followed by SVM, and then RF and NN, while the least robust model is DT. Furthermore, an interesting observation is that a robust machine learning model to limited dataset does not necessary imply that it provides the best performance compared to other models.


Introduction
The success of modern healthcare services, such as automated diagnosis and personalized medicine, is eminently dependent on the availability of datasets.The dataset size is considered a critical property in determining the performance of a machine learning model.Typically, large datasets lead to better classification performance and small datasets may trigger over-fitting [1][2][3].In practice, however, collecting medical data faces many challenges due to patients' privacy, lack of cases due to rare conditions [4], as well as organizational and legal challenges [5,6].Moreover, in the case of available large datasets, training a model using such data requires further time and computing resources, which may not be available.
Despite the continuous debates and efforts, there is still no agreed definition of what constitutes a small dataset.For instance, Shawe-Taylor et al. [7] proposed a measurement called Probably Approximately Correct (PAC) for identifying the minimum number of necessary samples to meet the desired accuracy.Some research [8] has defined small datasets based on algorithmic information theory.The authors in [9] followed a different approach by examining previous studies that are concerned with dealing with small datasets and their sizes and accordingly defined a range for the size of small datasets.
Establishing a method to find the trend in small datasets is not only of scientific interest but also of practical importance and requires a special care when developing machine learning models.Unfortunately, classification algorithms may perform worse when trained with limited size datasets [2].This is because small datasets typically contain less details, hence the classification model cannot generalize patterns in training data.In addition, over-fitting becomes much harder to avoid as it sometimes goes beyond training data to affect the validation set as well [3].
Classification is a challenging task by itself.It becomes more challenging when dealing with small datasets.The central cause behind this challenge relates to the limited size of training data, which leads to unreliable and biased classification model [3].While previous studies are focusing on increasing the accuracy of the classification algorithms on limited size datasets, less effort was made to study the impact of the size property of the dataset Appl.Sci.2021, 11, 796 2 of 18 on the performance of the classification algorithms, which makes it an open problem in the area that needs more investigation.
Several studies have emerged recently that address the issue of small datasets from different perspectives, including enhancing the performance of classification models on limited datasets [8][9][10][11] and proposing varying approaches to augment the training set [12][13][14][15][16]. For example, in the former category, authors in [8] proposed two methods for neural network (NN) training on small datasets using Fuzzy ARTMAP neural networks [10].In [11], a novel particle swarm optimization-based virtual sample generation (PSOVSG) approach was proposed to iteratively produce the most suitable virtual samples in the search space.The performance of PSOVSG is tested against other three methods and had superior results.
In the latter category, Li et al. [12] proposed a non-parametric method for learning trend similarities between attributes and then using them to predict the respective ranges in which attribute values can be situated when other attribute values are provided.Another study [13] generated data based on the Gaussian distribution by utilizing the smoothness which states that, if two inputs are close to each other, their outputs will be close as well.In [14], the authors learned the relationship between the dataset features to generate new data attributes using the fuzzy rules.Other studies [15][16][17] have proposed the extending attribute information (EAI) method to investigate the applicability of extracting features from small datasets by applying the similarity-based algorithms using fuzzy membership function on seven different data sets.Authors in [18] proposed the sample extending attribute (SEA) method to extend a suitable quantity of attributes for improving the learning performance of small datasets and preventing the data from becoming sparse.
Research on the subject has been mostly restricted on increasing the accuracy of the classification algorithms on limited size datasets, little attention has been paid to study the impact of the dataset size on the performance of the classification algorithms.However, the proposed solutions suffer from multiple issues, such as data replicates [13], unscalability [8,10], and noise [13,19].Similar studies to our work exist in the literature, where the main aim is to investigate the extent to which the size of the dataset can impact the classification performance in different domains such as sentiment classification [2,20], object detection [21], plant disease classification [22], and information retrieval [23].Table 1 summarizes the most relevant related works.

Ref.
Purpose/Goal No. of Datasets Dataset Size Range [8] Enhance the performance of models on limited datasets 1 176 [11] Enhance the performance of models on limited datasets 2 NA [12] Augment training set instances 2 (19-30) [13] Augment training set instances 3 (66-90) [14] Extend training set features This work aims to investigate the impact of dataset size on the performance of six widely-used supervised machine learning models in the medical domain.For this purpose, we carried out extensive experiments on six classification models including support vector machine (SVM), neural networks (NN), C4.5 decision tree (DT), random forest (RF), adaboost (AB), and naïve Bayes (NB) using twenty medical UCI datasets [24].We further implemented three dataset size reduction scenarios on two large datasets, resulting in three small subsets.We then analyzed the change in performance of the models as a response to the reduction of dataset size with respect to accuracy, precision, recall, f-score, specificity, and area under the ROC curve (AUC).Statistical tests are used to assess the statistical significance of the differences in performances in different scenarios.
The rest of the paper is organized as follows.In Section 2, we describe the methodology, including the datasets, the classification models, and performance evaluation.In Section 3, we present and discuss the results.Finally, Section 4 concludes our work.

Methodology
As mentioned earlier, this study aims to investigate the impact of dataset's size on the classification performance and recommend the appropriate classifier(s) for limited-size datasets.In order to achieve this goal, we followed an experimental methodology, where we selected datasets of varying sizes and grouped them into two groups: small datasets and large datasets.We extracted three small datasets randomly using sampling without replacement from each large dataset.The partitioning protocol is described in Section 2.1 below.The goal is to examine the impact of reducing the size of the same dataset on the classification performance.After preprocessing the datasets, a total of six widely-used classification models were trained on all datasets.The performance of the classifiers is evaluated with respect to accuracy, precision, recall, specificity, f-score, and AUC.In the following subsections, we will discuss the dataset selection and partitioning algorithm, the classification models, and the performance evaluation metrics.

Dataset
We selected twenty data sets from the UCI data repository [24].The datasets were selected from medical fields where limited data are common.Table 2 shows details about the selected datasets, arranged by size, along with their number of attributes and data type.There is no explicit definition for small datasets in the literature.Therefore, in order to determine the size range for selecting small datasets in this work, we reviewed existing works that study small datasets and kept track of the size of their datasets.As shown in Table 1, the size of small datasets used in the existing works ranges from 18 to 1030 across studies [8,[11][12][13][14][15]17,18].Accordingly, the selected twenty datasets were categorized as eighteen small datasets and two large datasets.
The small datasets (DS1-DS18) consist of eighteen medical datasets.The number of instances in these small datasets ranges from 80-1040 instances, and the number of features ranges between 3-49.All small datasets are numerical or numerical with text.In the category of large datasets, there are two datasets; Skin Segmentation dataset (DS19 in Table 2) and Diabetes 130-US hospitals dataset (DS20 in Table 2).The former consists of 245,057 instances and four features of numeric datatype, while the latter has 9871 instances and 55 features of mixed numeric and text datatypes.
To study the impact of dataset size on the performance of classifiers, we constructed three small sub-datasets of increasing sizes from each large dataset using sampling without replacement, as shown in Table 3. Figure 1 presents the dataset portioning algorithm.As shown in the figure, the algorithm receives two large datasets S 1 and S 2 and returns three small sub-datasets S 1 , S 2 , and S 3 for each large dataset.It first defines the sizes of the three small sub-datasets (980, 490, and 98).These were selected from the three equal intervals (highest, middle, and lowest) of the size range of small datasets (18-1030), respectively.Next, the algorithm iterates over the large datasets S 1 and S 2 .For each dataset, the algorithm creates a copy of the dataset (SL) to void modifying the original dataset.
The algorithm then iterates over the array of small sizes in order to create the corresponding small sub-dataset SS i , where X tuples are extracted randomly without replacement to avoid overlapping between the sub-datasets.This is performed by removing the sub-dataset SS i from the large dataset SL after extraction.The iterations continue until all three sub-datasets are created for all large datasets.Data preprocessing was carried out for all datasets as necessary to deal with missing values.algorithm then iterates over the array of small sizes in order to create the corresponding small sub-dataset SSi, where X tuples are extracted randomly without replacement to avoid overlapping between the sub-datasets.This is performed by removing the sub-dataset SSi from the large dataset SL after extraction.The iterations continue until all three sub-datasets are created for all large datasets.Data preprocessing was carried out for all datasets as necessary to deal with missing values.

Classification Models
We used six different widely-used classifiers, which include probabilistic classification using naïve Bayes (NB), decision function classification using support vector machine (SVM), neural network (NN), decision tree induction C4.5 (DT), tree ensemble random forest (RF), and ensemble adaptive boosting (AB).Below, we shed light on these classification models: • SVM: The objective of the SVM algorithm is to find the hyperplane in the data that gives the largest separation margin between data instances and classifies them into two classes.It can be explained based on four basic concepts, the separating hyperplane, the maximum margin hyperplane, the soft margin, and finally the kernel function [25,26].• NB: It is a supervised learning method based on the Bayesian theorem.Therefore, it is considered as a statistical method for classification.It works by calculating explicit probabilities for hypotheses.NB models use the method of maximum likelihood for parameter estimation.Literature showed that it often performs better in many complex real world applications.One of the features of this method is that it is robust to noise in data, and it can estimate the parameters using a small training set [25][26][27].• AB: One of the most important "families" of ensemble methods is Boosting, and within the boosting algorithms, the adaptive boosting (AB) algorithm is one of the most important.The adaptiveness of AD comes in the form of successive weak learners and fine-tuning them in favor of those instances misclassified by previous classifiers.Some of the properties of AD is that it is sensitive to noisy data and outliers, but, in some cases, it can be less susceptible to the overfitting than other learning algorithms [28].

Performance Evaluation
In contrast to most existing efforts in literature, which used accuracy as the performance measure, we evaluate the performance of the classification models with respect to six important metrics in the medical domain, namely, accuracy, precision, recall, F-score, specificity, and AUC.Furthermore, the Mann-Whitney U test is applied to assess the statistical significance between the performance of the models in different scenarios.

Results
In the following sections, the experimental results are presented for the classification models with both small datasets and large datasets with their subsets.The experiments were carried out on Waikato Environment for Knowledge Analysis (WEKA) version 3.8 [29] on a Windows 10 personal computer with CPU 2.70 GHz, Core i7 processor and 8.0 GB memory (RAM).For all classification models, we used WEKA default parameter values, which are shown in Table 4.Each reported result is the average of 10-fold cross validation.

Small Datasets
The performance of the six classification models, namely AB, RF, NN, DT, NB, and SVM when trained on the eighteen small datasets is presented in Table 5 with respect to accuracy.The performance of the classification models with respect to precision, recall, specificity, f-score, and AUC are shown in Tables A1-A5 in the Appendix A. Several observations can be made from Table 5.First, we can observe that the average accuracy of classifiers trained on the small datasets ranges from 62% on DS18 to 99% on DS1 and DS8.Second, it can be seen from the table that the average accuracy of classifiers across the small datasets ranges from 79.28% achieved by AB to 82.78% accuracy by DT.Third, we can also see that the standard deviations across classifiers (Std.Dev.For each dataset, last column) are less than the standard deviations across datasets (Std.Dev.For each classifier, last row).
Similar trends are observed in the performance of classifiers with respect to precision, recall, specificity, f-score, and AUC in Tables A1-A5 in the Appendix A. For instance, the average precision of classifiers in Table A1 ranges from 62.43% on DS18 to 99% on DS1 and DS8, and the average recall ranges from 61.68% on DS18 to 99.12% on DS8 (see Table A2).In addition, the average precision of classifiers across the small datasets ranges from 78.07%precision by AB to 82.21% achieved by NB.For recall, the average performance of classifiers across the small datasets ranges from 79.22% by AB to 82.73% by DT.Furthermore, we can see in Tables A1-A5, and, similar to accuracy in Table 5, that the standard deviations across classifiers are less than the standard deviations across datasets.

Large Datasets
Figures 2 and 3 show the performance of the six classification models with respect to accuracy, precision, recall, f-score, specificity, and AUC when trained on the large datasets, namely diabetes and skin segmentation, respectively, across decreasing sizes of the training set.The x-axis in the figures shows the size of the dataset, namely large dataset (LD), small dataset of size 980 (SD980), small dataset of size 490 (SD490), and small dataset of size 98 (SD98).LD indicates that the full size of the large dataset, as shown in Table 3, is used for training for both diabetes and skin segmentation datasets.
model increases as the diabetes dataset size decreases.Third, the best performing classifiers may vary across datasets.For instance, in the diabetes dataset (Figure 2), the best performing classifiers are SVM and NN, while, in the skin segmentation dataset (Figure 3), RF, DT, and NN perform the best.However, in both datasets, AB is the least performing classifier with respect to most performance metrics.

Small Datasets
The results presented in Section 3.1 are quite revealing in several ways.First, they In all figures, each line chart has three segments reflecting the result in three reduction scenarios of datasets size.The first segment ranges from LD to SD980 and shows the result in the first size reduction scenario, which we refer to as the LD-SD980 scenario.This line segment presents a key result in the chart as it depicts the change in performance of a classifier trained on a large dataset (LD) when trained on a small dataset of size 980 (SD980).The second segment in the line charts stretches from SD980 to SD490.It illustrates the change in performance of a classifier in the second size reduction scenario SD980-SD490, where the size of the dataset reduces from 980 (SD980) to an even smaller dataset of size 490 (SD490).In a similar manner, the third segment in the line charts extends from SD490 to SD98.It shows the change in performance of a classifier when the size of the dataset reduces from 490 (SD490) to a smaller dataset of size 98 (SD98), which we refer to as the third scenario SD490-SD98.
Several observations can be made from these figures.First, most classifiers exhibit relatively similar trend of performance over decreasing training set size with respect to all six performance metrics.This can be seen by comparing the performance of one classifier across performance metrics.Second, there is a clear general trend of decreasing performance with respect to all metrics for almost all classifiers in all size reduction scenarios on both datasets, although the classifiers showed varying reactions to the different size reduction scenarios.The most striking observation is that the performance of the AB model increases as the diabetes dataset size decreases.Third, the best performing classifiers may vary across datasets.For instance, in the diabetes dataset (Figure 2), the best performing classifiers are SVM and NN, while, in the skin segmentation dataset (Figure 3), RF, DT, and NN perform the best.However, in both datasets, AB is the least performing classifier with respect to most performance metrics.

Small Datasets
The results presented in Section 3.1 are quite revealing in several ways.First, they reveal that, depending on the problem domain, dataset size is not necessarily an obstacle to a high preforming model since the average performance of classifiers reached 99% on some small datasets.Second, since the standard deviations across classifiers are less than the standard deviations across datasets, the results indicate that, given a small dataset, classifiers perform relatively similarly, while each classifier has varying performance across the small datasets.On assessing the statistical significance of the difference between the two groups of standard deviations, we found that the difference is significant (p = 0.00076) at p < 0.05.The null hypothesis for this test asserts that the median of the two groups is identical.Taken together, these results reveal that constructing a dataset that is well representative of the original distribution, despite the size, is more important than choosing a classification model.

Large Datasets
Interestingly, the classifiers exhibited varying reactions to the different size reduction scenarios.We used a Mann-Whitney U test at p < 0.05 in order to assess the statistical significance of the differences in performances in different scenarios.In each test, we compare two groups of values that represent the performance of one model on a dataset of two sizes.Each group contains the performance of the model in ten folds.Tables A6-A11 in the Appendix A show the resulting p-value for all classification models in each reduction scenario, which show whether the scenario caused a significant decrease in the model performance with respect to size measures.
Statistical tests revealed that DT is the most sensitive model to the size of the dataset since its performance decreases significantly in the majority of the scenarios (~70% of scenarios in Table A8).RF and NN showed a relatively similar response to the decrease of dataset size as they show significant performance degradation in 44% and 42% of the scenarios in Tables A7 and A9, respectively.Tree-based models are trained by splitting the data based on predictor variables to find pure subsets (i.e., instances that belong to the same class) that will be used to compute the conditional probabilities.Therefore, the model's predictions are based on considerably smaller data than the original dataset.For NN, the model learns by adjusting a large number of weights using backpropagation.Thus, more data allows further adjustment, and hence better performance.The next model is SVM, where its performance decreases significantly in 36% of the scenarios in Table A6.As is well known, the position of the SVM hyperplane is based only on the support vectors.Consequently, the size of the dataset is irrelevant as long as the data include the support vectors.AB and NB exhibited robust performance as they decrease significantly only in 13% and 19% of the scenarios in Tables A10 and A11, respectively.Since NB is a simple algorithm that assumes conditional independence between variables, it needs less data to train.This makes it a high-bias model, but immune to the most common issue of small training set: overfitting.
Together, these results provide important insights into dataset size and classifiers performance.First, in support to our previous observation, the overall performance of classifiers depends on the extent to which the dataset represents the original distribution rather than its size.Second, it is clear from our experiments and statistical tests that the most robust model for small medical datasets appears to be AB and NB, followed by SVM, and then NN and RF, while the least robust model is DT.Third, on comparing the classifiers performance on small datasets (Tables 5 and A1, Table A2, Table A3, Table A4, Table A5) and their performance in the three reduction scenarios of datasets size (Tables A6-A11 and Figures 2 and 3), an interesting observation can be made: a robust machine learning model to dataset size reduction does not necessary imply that it provides the best performance compared to other models.This is evident by the observation that AB and NB were the most robust models to dataset size reduction, but they had the least average accuracy on the small datasets in Table 5, compared to other models.In addition, as explained in Section 3.2, AB was the least performing classifier with respect to most performance metrics in both large datasets.

Conclusions
Recent years have witnessed an increased interest in modern healthcare services, such as automated diagnosis and personalized medicine.However, the success of such services is eminently dependent on the availability of datasets.Collecting medical data may face many challenges such as patients' privacy and lack of data for rare conditions.This work aims to investigate the impact of dataset size on the performance of six widely-used supervised machine learning models in the medical domain.For this purpose, we carried out extensive experiments on six classification models including SVM, NN, DT, RF, AB, and NB using twenty medical UCI datasets [24].We further implemented three dataset size reduction scenarios on two large datasets, resulting in three small subsets.We then analyzed the change in performance of the models as a response to the reduction of dataset size with respect to accuracy, precision, recall, f-score, specificity, and AUC.Statistical tests are used to assess the statistical significance of the differences in performances in different scenarios.
Several interesting conclusions can be made.First, the overall performance of classifiers depends on the extent to which a dataset represents the original distribution rather than its size.Second, the most robust model for limited medical data is AB and NB, followed by SVM, and then RF and NN, while the least robust model is DT.Third, a robust machine learning model to limited dataset does not necessarily imply that it provides the best performance compared to other models.Our results are in agreement with previous studies [2].A natural progression of this research would be to investigate the minimum dataset size that each classifier needs in order to maximize its performance.Project number (RSP-2020/204), King Saud University, Riyadh, Saudi Arabia.

Conflicts of Interest:
The authors declare no conflict of interest.The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.
• DT: A Decision Tree is constructed as a binary classification tree, based on the training data.In the tree structure, class labels are represented by leaf nodes, while the internal nodes represent the conjunction of features that assess class.There are several DT algorithms, Notable decision tree algorithms include: ID3 (Iterative Dichotomiser 3), C4.5 (successor of ID3), and CART (Classification And Regression Tree)[25,26].In this study, the C4.5 algorithm for DT is selected for deploying the DT classification.•NN: It is one of the most widely-used classification models, as it is a good alternative to several traditional classification methods.One of the main advantages of NN is that it is a data-driven self-adaptive method, in that it is adjustable to the data without the need for explicit specification of the underlying model.Another feature of NN is that it represents a nonlinear model-free method[25][26][27].•RF:As the name implies, the RF classifier consists of a number of individual decision trees.Each of the individual decision trees in the forest is used for a majority voting of the output class, the class that has the majority of votes becomes the model's predicted class[25].

Table 4 .
Classification models parameter values.

Figure 2 .
Figure 2. Performance of classifiers with respect to (a) accuracy, (b) precision, (c) recall, (d) f-score, (e) specificity, (f) AUC when trained on diabetes dataset and its small subsets.

Figure 3 .
Figure 3. Performance of classifiers with respect to (a) accuracy, (b) precision, (c) recall, (d) f-score, (e) specificity, (f) AUC when trained on skin segmentation dataset and its small subsets.

Table 1 .
Comparison of related works.

Table 3 .
Large Datasets and their subsets.

Table 5 .
Accuracy of classifiers trained on small datasets.

Table A1 .
Precision of classifiers trained on small datasets.

Table A2 .
Recall of classifiers trained on small datasets.

Table A3 .
F-score of classifiers trained on small datasets.

Table A4 .
Specificity of classifiers trained on small datasets.

Table A5 .
AUC of classifiers trained on small datasets.

Table A6 .
p-values for different size reduction scenarios using the SVM model; bold values are significant.

Table A7 .
p-values for different size reduction scenarios using the NN model; bold values are significant.

Table A8 .
p-values for different size reduction scenarios using the DT model; bold values are significant.

Table A9 .
p-values for different size reduction scenarios using the RF model; bold values are significant.

Table A10 .
p-values for different size reduction scenarios using the AB model; bold values are significant.

Table A11 .
p-values for different size reduction scenarios using the NB model; bold values are significant.