Next Article in Journal
From Combinations to Single-Molecule Polypharmacology—Cromolyn-Ibuprofen Conjugates for Alzheimer’s Disease
Next Article in Special Issue
Discovery of Novel 1,2,4-Oxadiazole Derivatives as Potent Caspase-3 Activator for Cancer Treatment
Previous Article in Journal
Chemical Composition and Cytotoxic Activity of the Fractionated Trunk Bark Essential Oil from Tetraclinis articulata (Vahl) Mast. Growing in Tunisia
Previous Article in Special Issue
Enhancing Carbon Acid pKa Prediction by Augmentation of Sparse Experimental Datasets with Accurate AIBL (QM) Derived Values
Article

Effect of Dataset Size and Train/Test Split Ratios in QSAR/QSPR Multiclass Classification

1
Department of Plasma Chemistry, Institute of Materials and Environmental Chemistry, ELKH Research Centre for Natural Sciences, Magyar Tudósok krt. 2, H-1117 Budapest, Hungary
2
Medicinal Chemistry Research Group, ELKH Research Centre for Natural Sciences, Magyar Tudósok krt. 2, H-1117 Budapest, Hungary
*
Author to whom correspondence should be addressed.
Academic Editors: Kok Hwa Lim and Alla P. Toropova
Molecules 2021, 26(4), 1111; https://doi.org/10.3390/molecules26041111
Received: 23 December 2020 / Revised: 4 February 2021 / Accepted: 16 February 2021 / Published: 19 February 2021
(This article belongs to the Special Issue QSAR and QSPR: Recent Developments and Applications II)
Applied datasets can vary from a few hundred to thousands of samples in typical quantitative structure-activity/property (QSAR/QSPR) relationships and classification. However, the size of the datasets and the train/test split ratios can greatly affect the outcome of the models, and thus the classification performance itself. We compared several combinations of dataset sizes and split ratios with five different machine learning algorithms to find the differences or similarities and to select the best parameter settings in nonbinary (multiclass) classification. It is also known that the models are ranked differently according to the performance merit(s) used. Here, 25 performance parameters were calculated for each model, then factorial ANOVA was applied to compare the results. The results clearly show the differences not just between the applied machine learning algorithms but also between the dataset sizes and to a lesser extent the train/test split ratios. The XGBoost algorithm could outperform the others, even in multiclass modeling. The performance parameters reacted differently to the change of the sample set size; some of them were much more sensitive to this factor than the others. Moreover, significant differences could be detected between train/test split ratios as well, exerting a great effect on the test validation of our models. View Full-Text
Keywords: machine learning; XGBoost; validation; training/test split ratio; multiclass classification; imbalanced machine learning; XGBoost; validation; training/test split ratio; multiclass classification; imbalanced
Show Figures

Graphical abstract

MDPI and ACS Style

Rácz, A.; Bajusz, D.; Héberger, K. Effect of Dataset Size and Train/Test Split Ratios in QSAR/QSPR Multiclass Classification. Molecules 2021, 26, 1111. https://doi.org/10.3390/molecules26041111

AMA Style

Rácz A, Bajusz D, Héberger K. Effect of Dataset Size and Train/Test Split Ratios in QSAR/QSPR Multiclass Classification. Molecules. 2021; 26(4):1111. https://doi.org/10.3390/molecules26041111

Chicago/Turabian Style

Rácz, Anita, Dávid Bajusz, and Károly Héberger. 2021. "Effect of Dataset Size and Train/Test Split Ratios in QSAR/QSPR Multiclass Classification" Molecules 26, no. 4: 1111. https://doi.org/10.3390/molecules26041111

Find Other Styles
Note that from the first issue of 2016, MDPI journals use article numbers instead of page numbers. See further details here.

Article Access Map by Country/Region

1
Back to TopTop