Next Article in Journal
A Grid Search-Based Multilayer Dynamic Ensemble System to Identify DNA N4—Methylcytosine Using Deep Learning Approach
Previous Article in Journal
Feasibility of Community Pharmacist-Initiated and Point-of-Care CYP2C19 Genotype-Guided De-Escalation of Oral P2Y12 Inhibitors
Previous Article in Special Issue
GReNaDIne: A Data-Driven Python Library to Infer Gene Regulatory Networks from Gene Expression Data
 
 
Article
Peer-Review Record

Efficient Selection of Gaussian Kernel SVM Parameters for Imbalanced Data

by Chen-An Tsai * and Yu-Jing Chang
Reviewer 1: Anonymous
Reviewer 2:
Submission received: 14 January 2023 / Revised: 11 February 2023 / Accepted: 23 February 2023 / Published: 25 February 2023
(This article belongs to the Special Issue Machine Learning Supervised Algorithms in Bioinformatics)

Round 1

Reviewer 1 Report

This manuscript proposes Gaussian radial basis kernal-based SVM algorithm to address imbalanced data problem and algorithm parameter selection. The paper is mostly accessible though part of it is written in vague terminology (see my comments below). The method is sufficiently novel. The key feature of the proposed method is its speed though its classification performance is slightly less than or similar to other methods compared here. I have the following comments

 

Line 7. 'Satisfying prediction power' is a vague term. Do not use vague and unquantifiable terms like 'satisfying'.

 

Line 132. k is undefined

 

Line 165. 'Satisfying result" is not a scientific term nor an acceptable way to describe results. Be more specific here.

 

Lines 160-165. Here the authors argue based on Figures 3 and 4 that the proposed method, b-SVM, yields similar results to CV-THR SVM and SMOTE SVM. The problem here is again vagueness. There is no proper definition of 'similar'. In several of the subplots of Figure 3 and 4 b-SVM performance is slightly lower than the other 2 methods yet the authors gloss over it and call it 'similar'. Yet, later in Lines 176-179 when presenting real data results (Figure 7), where b-SVM has slightly higher performance than CV-THR SVM and others, the authors term this slight difference as "high-values of G-mean'. This is all vague and in consistent. First of all, the metric should be quantified. b-SVM is x% better than other methods or vice versa. And secondly, if the authors deem an x% difference of G-mean performance between two methods to be 'similar' in simulated data case (Figures 3 and 4), the same definition should apply to real data case (Figure 7). The main advantage of the method seems to be its fast computation time. Even if the classification performance is equal to other methods, that is fine and should be clearly stated.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 2 Report

This is well written paper about an interesting advancement in the area of Support Vector Machines. The proposed SVM algorithm, the problems that it addresses, and the methods used to test its efficiency are well explained in the manuscript. Hence my recommendations are more a matter of style than about the substance of the paper.

My recommendations are the following:

·        - Line 1: No need to say “gradually”, developing class prediction model has been important are of research for a while, no need to be modest.

·       -  Line 34: It is doubtful that SVM is “most popular classifier”. The authors need to provide a reference to backup this claim.

·       -  Lines 38-42: The meaning of the vector w is missing.

·       -  Line 91: The parameter gamma is introduced without a definition (it doesn’t appear in the equations in line 38-45), nor its meaning is provided hence it is not clear why we should be concerned with its optimization.

·        - Table 1: Mention earlier in the paper (possibly in the abstract itself) that these data sets are related to genes and bioinformatics. Otherwise, it is not clear why this paper was submitted to this special issue of the Genes journal.

·      -   Table 1: Although these are unbalanced data sets it would have been interesting to see how the studied algorithms perform when applied to rara disease that leads to extremely unbalanced data sets, e.g., multiple sclerosis emerges at 90 patients per 100,000 people, narcolepsy affects 50 patients per 100,000 and cystic fibrosis appears in 25 patients per 100,000.  Maybe simulated data that is as unbalanced as these examples could be used instead.

·        - Figure 1 &2: It is not clear that two parallelograms are needed at the top and bottom of these figures (they differ by one word and follow the same paths).

·       -  Figure 1 &2: These are so similar maybe they can be combined and save considerable real-state in the paper.

·        - Figures 3-8: Adding whiskers representing the 95% CI of the G-mean to the bars would help to assess if the performance of the b-SVM is truly significantly different from those of the other methods.

·      -   Line 200: Given that classic SVM is as fast as b-SVM (and that in many class prediction problems speed is not really important) and that given that "the values of standard errors are small, and less than 0.01" for all methods for the simulated data, it might be better to provide a nuanced conclusions with guidelines about when b-SVM is significantly better than the other methods and hence preferable to apply. Please consider follow this approach instead of stating broad conclusions such as  “our method b-SVM is more efficient”, which might hold only for the particular characteristics of the data selected for the experiments and initial conditions of parameters studied in this paper.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Round 2

Reviewer 1 Report

The authors have revised the manuscript in light of my previous comments. I have no furthe robjections to the publication of this manuscript.

Back to TopTop