Application of Machine Learning Models for Predicting pIC50 Values of Plasticizers Against Cytochrome P450 Aromatase
Abstract
1. Introduction
1.1. Non-Steroidal Aromatase Inhibitors
1.2. Steroidal Aromatase Inhibitors
- statement of significance:
2. Methodology
2.1. Dataset
2.2. Data Curation Workflow
2.3. Molecular Descriptors
2.4. Train/Test Split
2.5. Model Training Model Training and Hyperparameters
2.6. Model Validation
- (i)
- Train/test evaluation: All models were evaluated on both the training set and the held-out test set using R2, RMSE, and MAE. A large discrepancy between training and test metrics is a direct indicator of overfitting, the model has memorised patterns in the training data rather than learning transferable structure–activity relationships that generalise to new compounds.
- (ii)
- Five-fold cross-validation: The training set was partitioned into five folds (KFold, shuffle = True, random_state = 42) to obtain mean ± SD estimates of R2, RMSE, and MAE across folds. The mean CV R2 reflects average generalisation performance across different subsets of the training data. Critically, the standard deviation of CV R2 serves as a stability indicator, a large SD signals that model performance depends heavily on which specific compounds fall in each fold, exposing sensitivity to training set composition rather than robust learning of underlying chemical patterns. This is particularly informative for small datasets where individual compounds can have an outsized influence on model behaviour.
- (iii)
- Y-randomisation: pIC50 values in the training set were randomly permuted 100 times and the CatBoost model was independently retrained and evaluated on the test set for each permutation. If the real model’s performance substantially exceeds that of all permuted models, it confirms that predictive ability arises from genuine structure–activity relationships encoded in the MACCS descriptors rather than from statistical artefacts, dataset size, or chance correlations in the data partitioning.
2.7. Applicability Domain Assessment
2.8. Feature Importance Analysis
3. Results and Discussion
3.1. Model Performance: Training, Cross-Validation, and Test Set
3.2. Y-Randomisation Test
3.3. Applicability Domain Coverage and Structural Outliers
3.4. Permutation Importance and SHAP Interpretability
3.5. Predicted pIC50 Values for Representative Plasticizers
4. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Zhou, W.; Fang, F.; Zhu, W.; Chen, Z.-J.; Du, Y.; Zhang, J. Bisphenol A and ovarian reserve among infertile women with polycystic ovarian syndrome. Int. J. Environ. Res. Public Health 2017, 14, 18. [Google Scholar]
- Huo, W.; Xia, W.; Wan, Y.; Zhang, B.; Zhou, A.; Zhang, Y.; Huang, K.; Zhu, Y.; Wu, C.; Peng, Y. Maternal urinary bisphenol A levels and infant low birth weight: A nested case–control study of the Health Baby Cohort in China. Environ. Int. 2015, 85, 96–103. [Google Scholar] [CrossRef]
- Komarowska, M.D.; Grubczak, K.; Czerniecki, J.; Hermanowicz, A.; Hermanowicz, J.M.; Debek, W.; Matuszczak, E. Identification of the Bisphenol A (BPA) and the Two Analogues BPS and BPF in Cryptorchidism. Front. Endocrinol. 2021, 12, 694669. [Google Scholar] [CrossRef]
- Mahlangu, W.B.; Maseko, B.R.; Mongadi, I.L.; Makhubela, N.; Ncube, S. Quantitative analysis and health risk assessment of bisphenols in selected canned foods using the modified QuEChERS method coupled with gas chromatography-mass spectrometry. Food Packag. Shelf Life 2023, 37, 101078. [Google Scholar] [CrossRef]
- Lehmler, H.-J.; Liu, B.; Gadogbe, M.; Bao, W. Exposure to bisphenol A, bisphenol F, and bisphenol S in US adults and children: The national health and nutrition examination survey 2013–2014. ACS Omega 2018, 3, 6523–6532. [Google Scholar] [CrossRef]
- Qadeer, A.; Kirsten, K.L.; Ajmal, Z.; Jiang, X.; Zhao, X. Alternative plasticizers as emerging global environmental and health threat: Another regrettable substitution? Environ. Sci. Technol. 2022, 56, 1482–1488. [Google Scholar] [CrossRef]
- Rochester, J.R.; Bolden, A.L. Bisphenol S and F: A systematic review and comparison of the hormonal activity of bisphenol A substitutes. Environ. Health Perspect. 2015, 123, 643–650. [Google Scholar] [CrossRef] [PubMed]
- Struzina, L.; Castro, M.A.P.; Kubwabo, C.; Siddique, S.; Zhang, G.; Fan, X.; Tian, L.; Bayen, S.; Aneck-Hahn, N.; Bornman, R. Occurrence of legacy and replacement plasticizers, bisphenols, and flame retardants in potable water in Montreal and South Africa. Sci. Total Environ. 2022, 840, 156581. [Google Scholar] [CrossRef]
- Di Nardo, G.; Zhang, C.; Marcelli, A.G.; Gilardi, G. Molecular and structural evolution of cytochrome P450 aromatase. Int. J. Mol. Sci. 2021, 22, 631. [Google Scholar] [CrossRef] [PubMed]
- Yoshimoto, F.K.; Guengerich, F.P. Mechanism of the third oxidative step in the conversion of androgens to estrogens by cytochrome P450 19A1 steroid aromatase. J. Am. Chem. Soc. 2014, 136, 15016–15025. [Google Scholar] [CrossRef] [PubMed]
- Turner, K.; Macpherson, S.; Millar, M.; Mcneilly, A.; Williams, K.; Cranfield, M.; Groome, N.; Sharpe, R.; Fraser, H.; Saunders, P. Development and validation of a new monoclonal antibody to mammalian aromatase. J. Endocrinol. 2002, 172, 21–30. [Google Scholar] [CrossRef] [PubMed]
- Hackett, J.C.; Brueggemeier, R.W.; Hadad, C.M. The final catalytic step of cytochrome P450 aromatase: A density functional theory study. J. Am. Chem. Soc. 2005, 127, 5224–5237. [Google Scholar] [CrossRef] [PubMed]
- Caldwell, G.W.; Yan, Z.; Lang, W.; Masucci, J.A. The IC50 concept revisited. Curr. Top. Med. Chem. 2012, 12, 1282–1290. [Google Scholar] [CrossRef]
- Geisler, J. Differences between the non-steroidal aromatase inhibitors anastrozole and letrozole–of clinical importance? Br. J. Cancer 2011, 104, 1059–1066. [Google Scholar] [CrossRef] [PubMed]
- Kijima, I.; Itoh, T.; Chen, S. Growth inhibition of oestrogen receptor-positive and aromatase-positive human breast cancer cells in monolayer and spheroid cultures by letrozole, anastrozole, and tamoxifen. J. Steroid Biochem. Mol. Biol. 2005, 97, 360–368. [Google Scholar] [CrossRef]
- Soares, T.A.; Nunes-Alves, A.; Mazzolari, A.; Ruggiu, F.; Wei, G.-W.; Merz, K. The (Re)-Evolution of Quantitative Structure–Activity Relationship (QSAR) studies propelled by the surge of machine learning methods. J. Chem. Inf. Model. 2022, 62, 5317–5320. [Google Scholar]
- Shoombuatong, W.; Schaduangrat, N.; Nantasenamat, C. Towards understanding aromatase inhibitory activity via QSAR modeling. EXCLI J. 2018, 17, 688–708. [Google Scholar]
- Zdrazil, B.; Felix, E.; Hunter, F.; Manners, E.J.; Blackshaw, J.; Corbett, S.; De Veij, M.; Ioannidis, H.; Lopez, D.M.; Mosquera, J.F.; et al. The ChEMBL database in 2023: A drug discovery platform spanning multiple bioactivity data types and time periods. Nucleic Acids Res. 2024, 52, D1180–D1192. [Google Scholar] [CrossRef]
- Landrum, G.A. RDKit: Open-Source Cheminformatics Software, version 2023.09.1; Zenodo: Geneva, Switzerland, 2023. [CrossRef]
- Gramatica, P. Principles of QSAR models validation: Internal and external. QSAR Comb. Sci. 2007, 26, 694–701. [Google Scholar] [CrossRef]
- Dearden, J.C.; Cronin, M.T.D.; Kaiser, K.L.E. How not to develop a quantitative structure–activity or structure–property relationship (QSAR/QSPR). SAR QSAR Environ. Res. 2009, 20, 241–266. [Google Scholar] [CrossRef]
- Prokhorenkova, L.; Gusev, G.; Vorobev, A.; Dorogush, A.V.; Gulin, A. CatBoost: Unbiased boosting with categorical features. Adv. Neural Inf. Process. Syst. 2018, 31, 6638–6648. [Google Scholar]







| Step | Description | Criterion/Action | Compounds Remaining | Compounds Removed | Ref. |
|---|---|---|---|---|---|
| 1 | Raw ChEMBL records (v33, CHEMBL1978) | Retrieve: IC50, nM, relation ‘=’ | 330 | — | [18] |
| 2 | Remove missing values | Drop null IC50 or null canonical SMILES | 322 | 8 | |
| 3 | Remove duplicate structures | One record per unique canonical SMILES | 249 | 73 | |
| 4 | Exclude intermediate-activity compounds | Remove 1000 < IC50 ≤ 10,000 nM | 187 | 62 | |
| 5 | pIC50 transformation | pIC50 = −log10(IC50 × 10−9) | 187 | 0 | |
| 6 | Generate MACCS fingerprints | RDKit MACCSkeys (166 bits) | 187 | 0 | [19] |
| Final dataset for modelling | 187 | — |
| Model | Key Hyperparameters | Default/Tuned |
|---|---|---|
| Random Forest | n_estimators = 100, random_state = 42, n_jobs = −1 | Default |
| CatBoost | iterations = 300, learning_rate = 0.05, depth = 6, random_seed = 42 | Default |
| KNN | n_neighbors = 5, metric = ‘minkowski’ | Default |
| XGBoost | n_estimators = 200, learning_rate = 0.05, max_depth = 4, random_state = 42 | Default |
| LightGBM | n_estimators = 200, learning_rate = 0.05, num_leaves = 31, random_state = 42 | Default |
| Gradient Boosting | n_estimators = 200, learning_rate = 0.05, max_depth = 4, random_state = 42 | Default |
| Model | Train R2 | Train RMSE | Train MAE | CV R2 (Mean ± SD) | CV RMSE (Mean ± SD) | Test R2 | Test RMSE | Test MAE |
|---|---|---|---|---|---|---|---|---|
| Random Forest | 0.650 | 1.127 | 0.435 | −0.226 ± 0.775 | 1.859 ± 0.529 | 0.551 | 0.959 | 0.746 |
| CatBoost ★ | 0.732 | 0.986 | 0.367 | 0.062 ± 0.304 | 1.744 ± 0.587 | 0.693 | 0.794 | 0.659 |
| KNN | 0.375 | 1.505 | 0.995 | 0.008 ± 0.265 | 1.854 ± 0.547 | 0.456 | 1.057 | 0.835 |
| XGBoost | 0.711 | 1.024 | 0.327 | −0.640 ± 1.687 | 2.165 ± 0.720 | 0.432 | 1.079 | 0.777 |
| LightGBM | 0.297 | 1.597 | 0.682 | −0.081 ± 0.202 | 1.855 ± 0.520 | 0.324 | 1.178 | 1.008 |
| Gradient Boosting | 0.732 | 0.985 | 0.233 | −0.660 ± 1.676 | 2.175 ± 0.714 | 0.326 | 1.176 | 0.807 |
| Abbrev. | Full Name | Predicted pIC50 | Predicted IC50 (nM) | Leverage (h) | Applicability Domain |
|---|---|---|---|---|---|
| DEHP | Bis(2-ethylhexyl) phthalate | 8.22 | 6.0 | 0.169 | Inside AD |
| DBP | Dibutyl phthalate | 6.74 | 180.5 | 0.167 | Inside AD |
| BBP | Benzyl butyl phthalate | 4.13 | 74,329 | 0.175 | Inside AD |
| DINP | Diisononyl phthalate | 7.77 | 17.1 | 0.125 | Inside AD |
| BPA | Bisphenol A | 5.91 | 1226 | 0.052 | Inside AD |
| BPS | Bisphenol S | 5.70 | 2003 | 0.434 | Outside AD |
| BPF | Bisphenol F | 5.91 | 1226 | 0.052 | Inside AD |
| DPHP | Dipentylhexyl phthalate | 8.22 | 6.0 | 0.169 | Inside AD |
| TPBT | Tributyl phosphate | 6.85 | 141.7 | 0.793 | Outside AD |
| ATBC | Acetyltributyl citrate | 8.22 | 6.0 | 0.169 | Inside AD |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Mongadi, I.L.; Rapulenyane, N.; Mahlangu, W.B.; Oyourou, J.-N. Application of Machine Learning Models for Predicting pIC50 Values of Plasticizers Against Cytochrome P450 Aromatase. Chemistry 2026, 8, 68. https://doi.org/10.3390/chemistry8050068
Mongadi IL, Rapulenyane N, Mahlangu WB, Oyourou J-N. Application of Machine Learning Models for Predicting pIC50 Values of Plasticizers Against Cytochrome P450 Aromatase. Chemistry. 2026; 8(5):68. https://doi.org/10.3390/chemistry8050068
Chicago/Turabian StyleMongadi, Itumeleng Lucky, Nomasonto Rapulenyane, Walter Bonke Mahlangu, and Jean-Nazaire Oyourou. 2026. "Application of Machine Learning Models for Predicting pIC50 Values of Plasticizers Against Cytochrome P450 Aromatase" Chemistry 8, no. 5: 68. https://doi.org/10.3390/chemistry8050068
APA StyleMongadi, I. L., Rapulenyane, N., Mahlangu, W. B., & Oyourou, J.-N. (2026). Application of Machine Learning Models for Predicting pIC50 Values of Plasticizers Against Cytochrome P450 Aromatase. Chemistry, 8(5), 68. https://doi.org/10.3390/chemistry8050068

