Decision Tree-Based Data Stratification Method for the Minimization of the Masking Effect in Adverse Drug Reaction Signal Detection
Round 1
Reviewer 1 Report
This manuscript presents a nice application of decision trees for detecting adverse drug reactions and avoiding masking effects.
The idea of this work is very nice and not complex. The results are clear and the decision tree method can easily show its advantages over the classical methods.
It is an interesting article that deserves to be published.
Small comments:
1. In the abstract (methods): please define the term "F" (i.e., the arithmetic harmonic mean of Precision and Recall).
2. In the "Methods" section, you can include a brief explanation of decision trees for readers who are not familiar with them.
Author Response
Please see the attachment.
Author Response File: Author Response.docx
Reviewer 2 Report
This is the review of the paper titled " Decision Tree-based Data Stratification Method for the Minimization of Masking Effect in Adverse Drug Reaction Signal Detection"
The paper suffers from lack of novelty and the presentation of the paper is bad. The following comments are the reasons to reject the paper 1- Lack of Novelty , simple methods 2- The presentation of the paper is bad, so much details are needed on methods , contributions of the paper, research gap, abstract is too long 3- Results are low, no comparison with any previous methods. I believe the paper is not ready, I recommend that the authors should work one the issue that they mentioned in the limitations section, and put more effort on the proposed classifiers.Author Response
Please see the attachment.
Author Response File: Author Response.docx
Reviewer 3 Report
The paper deals with the problem of pharmacovigilance and proposes an interesting method for the minimization of the masking effect in adverse drug reaction signal detection based on machine learning. The main issues with this study are in the Methods section, which must be improved. There are also some minor English issues (I list a few below) which must be addressed before publication.
Abstract
Avoid acronyms. Also, you specified PRR in the introduction (correctly), so do not use it in the abstract.
Introduction
You do not seem to be aware of resampling methods to improve classification in imbalanced medical datasets, also by means of decision trees (J48):
Barbieri, D.; Chawla, N.; Zaccagni, L.; Grgurinović, T.; Šarac, J.; ÄŒoklo, M.; Missoni, S. Predicting Cardiovascular Risk in Athletes: Resampling Improves Classification Performance. Int. J. Environ. Res. Public Health 2020, 17, 7923.
“Which caused by data deviation”: which is caused? Can be caused?
“the ADR combination may be it will cause the strength of the combination of drug A and other adverse reactions or the strength of the combination of other drugs and adverse reactions B to become smaller”: this is not clear, you must re-write the sentence.
“Besides that, there are many causes of the masking effect, such as confounding factors, low data quality”: not here, put it above where you list other causes: “DPA has an inherent defect, the masking effect, which can be caused by data deviation, competition deviation, confounding factors, low data quality etc.”
“They thought that stratification can”: could reveal
“semi-quantitative Assessment”: non new line, no capital
“The results showed”: past, always use it when citing previous studies. Check spacing throughout the text, many blanks are missing.
Methods
This part is not so clear. Say explicitly which class you want to predict, and the predictors. Do the dataset report only cases of ADR? Which are the possible ADRs included in your study dataset? Or are you trying to improve binary prediction of all ADRs by means of stratification? Also, you do not address the problem of imbalance in your dataset. Accuracy is meaningless in such datasets.
2.1 Data source
“contains fields such as”: ideally, specify all the predictive variables you used. I also assume ADR 1/0 was the class attribute to be predicted
“missing information reports”: maybe records with missing information?
“Which including”: Check English
2.2 Stratification strategy
“confounding factor, such as age or gender”: why are they confounding? it may seem obvious, but you must explain.
“data stratification are to select”: delete “are to”
“(for example, metronidazole is mainly used for female gynecological diseases, and quinolone is mainly used for the elderly)”: between brackets, no quotes
“on decision tree”: on a decision tree or on decision trees
“it is deemed to a positive signal”: it is deemed to be? It is considered?
“is used to performance evaluation”: to perform evaluation? For performance evaluation? Again, not clear
Results
“The J48 classification algorithm in WEKA software”: this must be declared in Methods
Conclusions
“Data stratification can effectively reduce the data masking effect”: really? Did you calculate significance? Improvements do not seem to be relevant.
Author Response
Please see the attachment.
Author Response File: Author Response.docx
Round 2
Reviewer 2 Report
The authors haven't made serious changes as requested and the paper overall has a lack of novelty.
Author Response
Thank you for your comments. We have made major revisions to this article based on the comments of you and the other reviewers. Existing methods to minimize the masking effect of ADR signal detection only considered the factors of gender and age separately. The application of these methods in Chinese spontaneous reporting has not achieved good results. This is because China's ADR monitoring reports contain a large number of antibiotic cases. Our method attempted to take the drug category as a new confounding factor for the first time and achieved good results. More confounding factors should be further considered in the data stratification method. Therefore, this study is innovative in theory and application.
Reviewer 3 Report
The main problem which persists is that the Results do not seem to support the conclusions. Calculations in the Discussion are wrong. There are still some mistakes in English, and I list a few below. You still use the present tense, mixed with the past. Stick to past throughout the paper.
Abstract
“Some improved methods used gender and age for data stratification only considered…” not clear. Did you mean "methods which used..."?
“by Proportional Reporting Ratio method” by means of?
Introduction
“there are a large number”: there is a large
Methods
Please, specify the class to be predicted: is it ADR name? drug name?
“Because there are many differences...” makes no sense. Subject and verb? Re-write the whole sentence. Did you mean to attach it to the previous?
“to analyze the correlation between age and drug category ("Antibiotics" or "Non-antibiotics"), and gender was the same;” again, makes no sense, re-write the sentence
“Some indicators are used for performance evaluation, such as Precision, Recall and F.” No “some”, list all and only those used. If you use only those three: “Classification performances were assessed by means of Precision, Recall and F-measure.”
“The decision tree is the most widely used” rather: "is a widely used…" I do not think you can give reference to substantiate that it is thew most widely used overall.
“Stop when…” It stops. Besides, it does not necessarily stop when a pure node is reached.
Results
“It can be seen from Table 4 that the F of the three stratification methods has been improved. Among them, the decision tree-stratification is the most obvious, and the F value of the decision tree-stratification is an increase of 6.62% compared with non-stratification.” I honestly do not see these improvements. From the results table I can see F went from 29% (no stratification) to 31% (DT stratification), less than 2%. Is that a significant improvement? The only improvement I can see here is in R (recall).
Discussion
“largest manufacturers and users of antibiotics” manufacturer and user, singular
“52.26% to 68.83%, an increase of 31.7%... And the value of F-measure increased from 29.16% to 31.09%, an increase of 6.62%.” Please check these basic calculations. Have you used the first value (no stratification) as a base rate? This is not correct, since results are already percentages. So, for example, in the first case the improvement is 68.83% - 52.26%=16.57%. Same for the rest. If R went from 1% to 2% that is NOT a 100% improvement, but a 1% improvement.
Author Response
Please see the attachment.
Author Response File: Author Response.docx