Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

Generation of Controlled Synthetic Samples and Impact of Hyper-Tuning Parameters to Effectively Classify the Complex Structure of Overlapping Region

Appl. Sci. 2022, 12(16), 8371; https://doi.org/10.3390/app12168371

by Zafar Mahmood^1,†, Naveed Anwer Butt^1,†

, Ghani Ur Rehman^2,*,†

, Muhammad Zubair^2,†, Muhammad Aslam^3,†

, Afzal Badshah^4,†

and Syeda Fizzah Jilani^5,*,†

Reviewer 1: Anonymous

Reviewer 2:

Robin Singh Bhadoria

Reviewer 3:

Veenu Mangat

Reviewer 4:

Hongya Lu

Reviewer 5:

Chaman Sabharwal

Appl. Sci. 2022, 12(16), 8371; https://doi.org/10.3390/app12168371

Submission received: 3 July 2022 / Revised: 10 August 2022 / Accepted: 18 August 2022 / Published: 22 August 2022

(This article belongs to the Special Issue Advances in Artificial Intelligence: Machine Learning, Data Mining and Data Sciences)

Round 1

Reviewer 1 Report

The assembly of classifiers is a method widely used lately for classification problems, achieving higher performance but requiring a greater amount of time and hardware for its implementation. For small domains it is a very practical alternative but for large domains it can lead to very complex problems to solve. This study shows us how to choose the parameters to be tuned in the classification algorithms in a simple and practical way.

Author Response

Dear Reviewer,

Please Find the attached .docx document.

Regards,

Ghani ur Rehman

Author Response File: Author Response.docx

Reviewer 2 Report

# Paper discusses the learning methodologies on multi-class problems, more expertise is required in both traditional classifiers and problem domain datasets, which is appreciated.

# Add Two or More comparative tables of existing work in Section 2 of this paper so that it would be more impactful in terms of readability in related works.

# There is not uniformity in dividing the sections of this paper. It seems imbalance in terns of context to this paper. Section 6 & Section 7 can be clubbed together. And, Section 8, Section 9 & Section 10 can be merged as it looks very short in term of content.

# However, Section 9 seems to be very lengthy, So it can be divided into some sub-sections.

#Cite below articles to improve the readability of this paper:

(a) S. Karthik, RS Bhadoria, JG Lee, AK Sivaraman, S. Samanta, A. Balasundaram, BK Chaurasia, S. Ashokkumar, Prognostic Kalman Filter Based Bayesian Learning Model for Data Accuracy Prediction, Computers, Materials & Continua (CMC), Vol. 72(1):243-259, 2022.

(b) Singh, L. K., Garg, H., Khanna, M., & Bhadoria, R. S. (2021). An enhanced deep image model for glaucoma diagnosis using feature-based detection in retinal fundus. Medical & Biological Engineering & Computing, 59(2), 333-353.

Author Response

Dear Reviewer,

Please find the attached .docx

Author Response File: Author Response.docx

Reviewer 3 Report

In this paper, authors analyze the learning behavior of state-of-the-art ensemble and non-ensemble classifiers on imbalanced and overlapping multi-class data. They use grid search techniques to optimize key parameters (by hyper tuning) of ensemble and non-ensemble classifiers for multi-class imbalanced classification problem. Around 20% of the dataset samples are generated synthetically to augment the majority class. Authors have provided a brief description of tuned parameters and their effects on imbalanced data, and a comparison of ensemble and non-ensemble classifiers with the default and tuned parameters for both original and synthetically overlapped datasets. The authors claim that the main contribution of the paper is that they have performed novel experiment on hyper-tuning of 6 state-of-the-art ensemble and non-ensemble classifiers on multi-class imbalanced datasets using 4 different evaluation metrics viz. overall accuracy (ACC), Geometric mean (G-mean), F-measure and the Area Under Curve (AUC). Also, an algorithm is designed to synthetically generate and overlap the existing dataset by 20% of the existing samples to make it more complex.

Overall, the observations are:-

The problem is relevant and significant.

However, there are limitations of the work which should be addressed.

The writing is poor, there is no clarity and some sentences seem incomplete. For instance, the opening sentence of the abstract: “Involve multiple classes with uneven distribution of data samples, resulting in the majority 1 and minority classes.” There are many typos, such as ‘compassion’ instead of ‘comparison’ etc. Thorough proofreading is required.

The technical and scientific contribution is not too strong as the methodology is quite naïve.

There is no proper justification for selecting these particular methods and datasets for experimental evaluation.

The concepts are not clearly presented though there are numerous equations that are given. For instance, the concept of “overlapping samples” is not properly explained. Are these duplicate records, why do they occur, examples where they occur, etc.

Most of the references have simply been listed but are not related to the work done by the authors.

There are no significant results that can add to the knowledge base in this field.

Author Response

Dear Reviewer,

Please Find the attached .docx

Author Response File: Author Response.docx

Reviewer 4 Report

The manuscript is to improve performance on multiple-class imbalanced problems with overlapping region by hyper tuning and generation of synthetic samples. This research proposed grid search techniques to optimize key parameters to find out the optimal parameters by hyper tuning, and generated synthetical dataset samples to avoid imbalance issue in dataset. In addition, the authors made a detailed comparison of ensemble and non-ensemble classifiers with the default and tuned parameters for both original and synthetically overlapped datasets.

I have a few suggestions to further improve the paper. The details of review comments are listed as follows:

In this paper, the example models used include GB, RF, DT, KNN, R-SVM, and LR SOTA; is there a motivation selecting these models to test the search algorithm? Please add explanation why these models are used.
Whether extending this work to regression models is a next step, or something tested out? If so, may deserve to mention.
For results provided on tables 6-9 and figures 6-7, how many replications are executed to get results from each model? Are there any statistical differences between before and after results? Suggest to provide sufficient experimental validation on the results.
This paper states “we believe that the underlying paper is the first kind of effort in this domain”, just wondering whether it’s been reviewed that no hyper parameter tuning methods (such as BO, etc.) has been applied in this domain? And what’s the superiority of this grid search method over other methods like BO in this domain? Please state that.
Recommend to compare before/after results of same dataset/model in one chart for Figures 6-11; that will significantly improve the readers’ experience understanding the results.
Some table/figure formats, may be need to enhance.

Like Tables 4-9 are approaching the boundary.
Tables 8-9 should be accuracy/precision after (not “before”)
Suggest to change the axis from 0-100 to 60-100 for Figure6-11

Author Response

Dear Reviewer,

Please find the attached .docx file

Author Response File: Author Response.docx

Reviewer 5 Report

Comments

The paper is a good literature review on multiple class balanced and imbalanced data; and ensemble and non-ensemble algorithms. Experiments Grid of various algorithms, data sets, conventional and hyper parameters are considered for fine tuning data.

There is a good collection of references.

Shortcomings

There many acronyms used for algorithms and data, it is obligatory ?to have a table of all acronyms at one place for reference.

Poor English expression and sentence construction, typographical errors , grammar checking is required.

First sentence in abstract?

Line 7. traditional classifiers and classifiers

Line 37 insert DT for decision tree: Most of the traditional classifiers [2], like k-Nearest Neighbor (kNN), Naïve Bayes (NB), Artificial Neural Network (ANN), Decision Tree , Support Vector Machine (SVM), and Logistic Regression (LR) designed for the balanced and linear distribution of the instances in the training dataset between the classes.

Line 60 ….of both the majority and rare classes is nearly equal.. explain rare class

Line 70 correct grammer: overlapping problem may results in the model overfitting

Line 93. Redo the sentence

Line 99 correct: … synthetically generate and overlapped

Line 163 In the letter …

Line164 Xgboost not defined anywhere in the paper

Equation 4 sum of L week learner

Tables entries are not uniforlmly formatted.

Accuracy metrics and mentioned, not decribed to the reader.

Line 622 about the compassion?

There are too many errors. I gave up checking the accuracy of development and methods , any contribution by the authors.

Author Response

Dear Reviewer,

Please find the attached .docx file

Author Response File: Author Response.docx

Round 2

Reviewer 3 Report

Please check for any typographical errors and journal format.

Author Response

Dear Reviewer,

Please find the attached document.

Regards,

Ghani ur Rehman

Author Response File: Author Response.docx

Reviewer 4 Report

Thanks for the quick get back. The updated version looks better. Still there are some minor questions. From the cover letter, "Here in this case we are not concerned with the classifier performance, but rather to show the growing impact of the synthetic overlapping samples, which are created in each iteration and then inserted into the existing dataset. " If it is the case that classifier performances are not critical in representing the results of this proposed model, it may deserve considering some additional metrics that represents the impact of the synthetic overlapping samples instead of focusing on performance analysis which does not actually represent the superiority of the proposed method in this paper; such as the accuracies v.s. the percentage of data points synthetically injected to the existing data. It is still not quite clear how the synthetic data generation impacts the results, considering the original data sizes; how much the data overlap in different dataset and how they respond to the existing method; etc. Among the hundreds of experiments you have done, you might already have that information, which may deserve to get sorted out with data related metrics and analyzed to show in-depth understanding to this method.

Author Response

Dear Reviewer,

Please find the attach document.

Regards,

Ghani Ur Rehman

Author Response File: Author Response.docx

Reviewer 5 Report

Author Response

Dear Reviewer,

Thanks for no further comments and suggestions.

Regards,

Ghani Ur Rehman

Article Menu

Generation of Controlled Synthetic Samples and Impact of Hyper-Tuning Parameters to Effectively Classify the Complex Structure of Overlapping Region

Further Information

Guidelines

MDPI Initiatives

Follow MDPI