Enabling Early Prediction of Side Effects of Novel Lead Hypertension Drug Molecules Using Machine Learning
Round 1
Reviewer 1 Report
Comments and Suggestions for Authors
The reviewed paper addresses a very important and timely issue—the prediction of side effects of novel lead hypertension drug molecules. To solve this problem, the authors appropriately employ effective machine learning methods using well-known databases such as SIDER and ChEMBL. However, my main pretensions relate to the methodology used in conducting the QSAR studies:
- Unfortunately, the authors do not indicate the size of the training set used, nor do they mention a test set, which, most likely, was not used at all.
- In evaluating the performance of the classification QSAR models, important metrics such as recall, precision, and the Matthews correlation coefficient are not used. The out-of-bag procedure is not applied to assess the predictive power of the models, and there is no evaluation of the applicability domain for any of the QSAR models.
- The set of functional groups should reasonably be expanded to include the main types of heterocyclic systems.
- For model interpretation, the authors randomly select 50 molecules, then, for some unclear reason, divide them into five groups and select one molecule from each group. In my opinion, these manipulations are meaningless, especially given that the predictive power of the QSAR models has not been demonstrated.
Thus, the studies require substantial improvements. In its current form, the manuscript is not suitable for publication.
Author Response
Comment 1: Unfortunately, the authors do not indicate the size of the training set used, nor do they mention a test set, which, most likely, was not used at all.
Response 1: In response, we have explicitly stated in Section 2, subsection 2.2, that we used a dataset comprising 726 drug molecules as detailed explicitly in the original dataset file. This dataset was split explicitly into 70% training (508 samples) and 30% test set (218 samples) using stratified sampling
Comment 2: In evaluating the performance of the classification QSAR models, important metrics such as recall, precision, and the Matthews correlation coefficient are not used. The out-of-bag procedure is not applied to assess the predictive power of the models, and there is no evaluation of the applicability domain for any of the QSAR models.
Response 2: In response, we clarify that the precision and recall are reported in the paper under the F1-score metric, which is the metric used to represent the harmonic mean of precision and recall. We have implemented the AUC-ROC metric in the revised manuscript. The authors acknowledge the absence of MCC as a useful metric and will integrate that in future iterations of this research. We have not applied the out-of-bag (OOB) estimate due to the small size of the dataset. We hope that once we get more data into the current dataset, the OOB can be implemented.
Comment 3: The set of functional groups should reasonably be expanded to include the main types of heterocyclic systems.
Response 3: In response, we note that the reviewer raises a valid concern. Our current functional-group features, clearly enumerated in our manuscript (§2.3 ”Feature Engineering”), focus mainly on key pharmacologically relevant groups (amines, alcohols, halogens, carbonyl derivatives, etc.), intentionally chosen for interpretability and common relevance in hypertension drugs. That said, expanding features to include explicit heterocyclic systems may likely improve the predictive power. However, given the small sample size of our dataset and relatively high dimension (726 datapoints and over 25 features), we chose to exclude specific heterocyclic systems as categorical features to mitigate the curse of dimensionality.
Comment 4: For model interpretation, the authors randomly select 50 molecules, then, for some unclear reason, divide them into five groups and select one molecule from each group. In my opinion, these manipulations are meaningless, especially given that the predictive power of the QSAR models has not been demonstrated.
Response 4: In response, we state that we appreciate the reviewer’s concern about the interpretability approach. Initially, the subset selection (50 random molecules, grouped arbitrarily) was intended purely as illustrative examples to qualitatively showcase model interpretation techniques. We now clearly state in the revised manuscript (§3.4 ”Exploratory Case Study”) that this selection was exploratory rather than statistically representative. Furthermore, we clarify explicitly that these visualisations are meant to provide insights into model decision-making processes rather than rigorous quantitative validation, given the absence of ground truth for AI-generated molecules.
Reviewer 2 Report
Comments and Suggestions for Authors
-
The problem and evaluation setup are under-specified:
-
The multi-label classification task should be more clearly defined.
-
Important implementation details—such as how labels are structured, how F1 and AUC are averaged (macro, micro, per-sample), and how success is measured—are not explicitly stated.
-
The role of thresholding or calibration in label prediction is also unclear.
- How was cross-validation used?
-
-
Missing key references in the field:
The paper overlooks several relevant and recent works that have tackled similar problems using AI/ML:-
MolOptimizer: A Molecular Optimization Toolkit for Fragment-Based Drug Design
-
Inferring primase-DNA specific recognition using a data driven approach
-
Discovery of small-molecule inhibitors targeting the ribosomal peptidyl transferase center (PTC) of M. tuberculosis
-
Application of artificial intelligence and machine learning in early detection of adverse drug reactions (ADRs) and drug-induced toxicity
-
Predicting adverse side effects of drugs
-
Artificial intelligence for assessing side effects
-
-
No comparison to existing baselines:
-
The paper proposes a new model without benchmarking it against standard or published methods, which weakens its empirical contributions.
-
Many existing models address similar tasks with public datasets and could serve as strong baselines.
-
-
Missed opportunity to validate on benchmark datasets:
-
Numerous prior works provide datasets with labeled side effects and ground truth.
-
These could be used to evaluate generalizability or at least for model comparison.
-
If the authors intentionally avoided these datasets (e.g., due to feature mismatch), a justification should be included.
-
-
Evaluation on AI-generated molecules is not validated:
-
While useful as a demonstration, these predictions cannot be evaluated due to lack of ground truth.
-
The authors should clarify this limitation and frame it as exploratory rather than confirmatory.
-
Author Response
Reviewer 2
Comment 1:
The problem and evaluation setup are under-specified:
The multi-label classification task should be more clearly defined.
Important implementation details—such as how labels are structured, how F1 and AUC are averaged (macro, micro, per-sample), and how success is measured—are not explicitly stated.
The role of thresholding or calibration in label prediction is also unclear.
How was cross-validation used?
Response 1: The authors would like to note that the precision, recall, and F1 are weighted averages across the labels. This approach allowed us to account for class imbalance by weighting each label’s score by the number of true instances of that label. The AUC-ROC score is reported across each
label, but is also summarised across labels and thus also applies a variant of macro averaging.
In terms of the comment on the role of calibration/thresholding, we state that since the architecture produces continuous model probabilities, a threshold of 0.652 was applied based on preliminary analyses through which the authors identify an ideal balance between precision and recall across multiple labels. The data is initially split using a 70-30 train-test split. The train set is then used to choose the model using stratified 5-fold cross-validation in the training dataset to fine-tune hyperparameters. It was found that 3 folds provided a better balance, and to prevent overfitting, due to the availability of the dataset. Indeed, calibration on a boosted model yields better performance than when that model is not calibrated. This step is not done in this experiment because we do not wish to explore exact probabilities but rather a ranking between labels. That said, future iterations would benefit from this optimisation.
Comment 2: Missing key references in the field:
The paper overlooks several relevant and recent works that have tackled similar problems using AI/ML:
- MolOptimizer: A Molecular Optimization Toolkit for Fragment-Based Drug Design
- Inferring primase-DNA specific recognition using a data driven approach
- Discovery of small-molecule inhibitors targeting the ribosomal peptidyl transferase center (PTC) of M. tuberculosis
- Application of artificial intelligence and machine learning in early detection of adverse drug reactions (ADRs) and drug-induced toxicity
- Predicting adverse side effects of drugs
- Artificial intelligence for assessing side effects
Response 2: We have taken the recommended works into consideration in revising the manuscript. Some of the works are cited as references [16], [30], [31] in the revised manuscript.
Comment 3: No comparison to existing baselines:
The paper proposes a new model without benchmarking it against standard or published methods, which weakens its empirical contributions.
Many existing models address similar tasks with public datasets and could serve as strong baselines.
Response 3: The authors note the reviewer’s concern regarding a lack of a ’baseline’. It is noted, though, that a direct baseline is non-trivial due to the use case the paper chose to address. There is currently no dataset that simultaneously focuses on antihypertensive leads, exposes this set of identified functional groups, and, most importantly, contains pre-approval side effect labels. As such, it was necessary for this data to be synthesized and for the experiment to be repeated using three
different model architectures.
Comment 4: Missed opportunity to validate on benchmark datasets:
Numerous prior works provide datasets with labeled side effects and ground truth.
These could be used to evaluate generalizability or at least for model comparison.
If the authors intentionally avoided these datasets (e.g., due to feature mismatch), a justification should be included.
Response 4: The authors note the reviewer’s concern regarding a lack of a ’benchmark’. We have applied the same reasoning stated above for the “baseline” issue to this comment.
.
Comment 5: Consider
Evaluation on AI-generated molecules is not validated:
While useful as a demonstration, these predictions cannot be evaluated due to lack of ground truth. The authors should clarify this limitation and frame it as exploratory rather than confirmatory.
Response 5: In response, the authors agree with this feedback as predictions on de-novo molecules are inherently exploratory. We now state this up-front, relabel §3.4 as *Exploratory Case Study*, and add a caution that no clinical ground truth exists yet.
Round 2
Reviewer 1 Report
Comments and Suggestions for Authors
The authors have partially revised the manuscript, but in my opinion, not enough for publication. The main reasons are:
- In the new version of the manuscript, the authors provided information on the training and test sets, but then the question arises for which sample the information in Table 3 is provided. It should present the indicators for both sets (training and test).
- The authors plan to provide calculations of the Matthews correlation coefficients in a future publication. It is not clear what prevents them from doing this in the current publication. In addition, I insist on covering the results of using the out-of-bag procedure. The set size (726 molecules) is not an obstacle to this.
- The authors did not explain why they divided 50 selected molecules into 5 groups (Section 3.4)
- To interpret the models, the authors use information only on functional groups, the influence of other calculated molecular parameters - logP, HBD, HBA, PSA is not discussed at all.
The manuscript requires additional revision.
Author Response
Comment 1: In the new version of the manuscript, the authors provided information on the training and test sets, but then the question arises for which sample the information in Table 3 is provided. It should present the indicators for both sets (training and test).
Response 1: Thank you. In response, we have expanded Table 4 (previously Table 3) to include the model evaluation on both the training and the test set across the three models.
Comment 2: The authors plan to provide calculations of the Matthews correlation coefficients in a future publication. It is not clear what prevents them from doing this in the current publication. In addition, I insist on covering the results of using the out-of-bag procedure. The set size (726 molecules) is not an obstacle to this.
Response 2: In response, a new subsection, 2.4.6 Cross-Validation and Out-of-bag assessment, has been added to Section 2. In this section, we explain why stratified cross-validation is used for the two boosting models and why the out-of-bag estimates of the Random Forest are comparable. We also report on these validations in Table 3. MCC is now shown alongside Accuracy, weighted F1, and AUC‑ROC for both the resampled training split and the independent 30 % test split. Further, text in Sections 2.4.6 and 3.1 discusses how MCC changes from the cross-validation and out-of-bag estimate to the unseen test set.
Comment 3: The authors did not explain why they divided 50 selected molecules into 5 groups (Section 3.4)
Response 3: Thank you, in response, we have clarified in Section 3.4, now retitled “Exploratory Case Study on Fifty AI-generated Leads”, that the partitioning into five groups was to aid in visualisation. We sought only to illustrate the model interpretation instead of supporting quantitative claims.
Comment 4: To interpret the models, the authors use information only on functional groups, the influence of other calculated molecular parameters - logP, HBD, HBA, PSA is not discussed at all.
Response 4: Thank you, we have addressed this in Section 3.3 titled Analysis of Functional Groups & ADMET Properties, where we describe how logP, HBD, HBA, and PSA contribute to the Gradient Boosting classifier. We also supply a SHAP analysis to aid this. In general, we find the following relationship in terms of impact: PSA > logP > HBA > HBD. For five representative molecules, we report their raw logP, PSA, HBA, and HBD values alongside the model’s side‑effect probabilities. The narrative shows, for example, that Molecule 23 (PSA ≈ 108 Ų, logP ≈ 2.0, HBA = 7, HBD = 2) is flagged for liver, GI, and dermatological risk, consistent with its high polarity and multiple acceptors. We repeat this for all analysed molecules.
Reviewer 2 Report
Comments and Suggestions for Authors
The authors have addressed the comments. Some comments, like baselines or datasets, were not addressed. I understand the difficulty. I'm ok with accepting.
Mol optimizer - wrong authors (pelase double check ChatGPT if you use it). Real authors are here: https://www.mdpi.com/1420-3049/29/1/276
Author Response
Comment 1: There is an incorrect reference for Mol Optimizer.
Response 1: Thanks to the reviewer for the observation. This has been addressed.
Round 3
Reviewer 1 Report
Comments and Suggestions for Authors
Finally, after another revision of the manuscript, it became obvious that the models proposed in the work do not have predictive ability. This is evident from Table 4 — for the test set, the MCC values are in the range of 0.10–0.15.
Therefore, the authors should clearly state in the conclusion that the constructed models are descriptive in nature only and are currently not suitable for predicting the activity of new compounds. This significantly diminishes the quality of the work, and the Editorial Board should determine whether its publication is justified.
Author Response
Reviewer 1
Comment 1: Finally, after another revision of the manuscript, it became obvious that the models proposed in the work do not have predictive ability. This is evident from Table 4 — for the test set, the MCC values are in the range of 0.10 – 0.15. Therefore, the authors should clearly state in the conclusion that the constructed models are descriptive in nature only and are currently not suitable for predicting the activity of new compounds. This significantly diminishes the quality of the work, and the Editorial Board should determine whether its publication is justified.
Response 1: Thank you for your comment and correspondence throughout this review. Following your comment, we have explicitly stated in the Conclusion that the model should be considered descriptive rather than definitive.