Predicting the Aquatic Toxicity of Pharmaceutical and Personal Care Products: A Multitasking Modeling Approach
Abstract
:1. Introduction
2. Materials and Methods
- Data Selection and Preprocessing: Users provide the downloaded raw ECOTOX dataset file (.csv format), and a separate file (“First_input.csv”) specifying the desired columns for further processing. For example, in this work, the input file contains columns and values such as (a) SN—a serial number manually created to track the data points in the original dataset, (b) CAS number (required), (c) Chemical name, (d) Species Scientific Name, (e) Species Group, (f) Conc 1 Type (Standardized), (g) Conc 1 Mean Op (Standardized), (h) Conc 1 Mean (Standardized), (i) Conc 1 Units (Standardized): AI mg/L, (j) Endpoint: LC50, EC50, and (k) Observed Duration (Days).
- Additionally, the tool requires a specific cut-off value to define ‘toxic’ and ‘non-toxic’ classes. As mentioned earlier, a cut-off value of 5 mg/L was used in this work. The tool automatically saves a new file (“FirstFile_forChecking.csv”) containing the specified columns along with the activity column (i.e., ‘Active’). Basically, in this stage, the tool collects those columns that are necessary for modeling and removes unnecessary data, as specified by the users. Users can then manually check and modify “FirstFile_forChecking.csv” if necessary. For example, we created a new ‘Time’ category (Long, Medium, Short) based on the ‘Observed Duration (Days)’ column and removed irrelevant values like “NR” (Not Reported).
- Duplicate Removal and Conflict Resolution: Once satisfied with the preliminary data, users should upload the modified “FirstFile_forChecking.csv” and a separate file (“CAStoSMILES.csv”) linking CAS numbers to their corresponding SMILES notations (which ECOTOX does not provide). Tools like the Python-based CIRpy (available at https://github.com/mcs07/CIRpy, accessed on 16 June 2024) can assist in SMILES generation, but manual correction might be needed for CAS number formatting (missing hyphens in ECOTOX data). Alongside these two files, another input file (“Second_input.csv”) is needed at this stage. This file specifies the column names to be treated as final experimental elements for further processing. In this work, the following column names were included: (a) Species Scientific Name (sn); (b) Species Group (sp); (c) Conc 1 Type (Standardized); (d) Endpoint; and (e) Time. In this stage, the duplicated data points are identified from the SMILES notations and experimental elements. Finally, the tool generates two output files:
- “NoDuplicates.csv”: Contains data points without duplicates or duplicates with consistent activity classifications (all toxic or all non-toxic). Data points from this file are used for model development and validation.
- “DupConflict.csv”: Contains data points with conflicting duplications (e.g., data points classified as both ‘toxic’ and ‘non-toxic’). Data points from this file were not considered for modeling since conflicting activity classes were found.
- Linear model descriptors, where only descriptors from the most predictive linear model are selected.
- All descriptors, where all pre-treated descriptors are subjected to modeling.
- Random forest importance, where the random forest (RF) classifier is employed to select the most significant descriptors, with the help of Scikit-learn library [24].
- Recursive feature elimination (RFE), where the decision tree classifier is used as an estimator for descriptor selection. In RFE, a user-defined classifier is trained on the initial set of features, then feature importance is recorded and the least important ones are eliminated; this procedure is repeated recursively on the pruned set until a user-defined number of features are left.
- Sequential forward selection (SFS), where SFS is applied for feature selection sequentially with a defined scoring function (e.g., accuracy), whereas a specific user-defined ML tool is applied for model evaluation. In this case, the Mlxtend library is used for SFS feature selection, and the decision tree classifier (DTC) of Scikit-learn is employed for model evaluation.
- Genetic algorithm and k-nearest neighborhood (GA-kNN), where the Python-based sklearn-genetic module (https://github.com/manuel-calzolari/sklearn-genetic, accessed on 12 July 2024) is utilized to explore a ‘deap’ function for running GA, with the following parameters: population (100), mutation probability (0.2), cross-over probability (0.5), number of generations (100), number of generations no change (10), crossover independent probability (0.1), mutation independent probability (0.05), and fixed kNN parameters (‘12’ number of neighbors, ‘uniform’ weights, and ‘auto’ algorithm), whereas 5-fold CV is used along with accuracy scores for model evaluation. Note that in this tool, the stochastic GA method is used for descriptor selection, but each model is evaluated with kNN [25].
3. Results and Discussions
4. Conclusions
Supplementary Materials
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Bu, Q.; Wang, B.; Huang, J.; Deng, S.; Yu, G. Pharmaceuticals and personal care products in the aquatic environment in China: A review. J. Hazard. Mater. 2013, 262, 189–211. [Google Scholar] [CrossRef] [PubMed]
- Wang, H.; Xi, H.; Xu, L.; Jin, M.; Zhao, W.; Liu, H. Ecotoxicological effects, environmental fate and risks of pharmaceutical and personal care products in the water environment: A review. Sci. Total Environ. 2021, 788, 147819. [Google Scholar] [CrossRef] [PubMed]
- Chakraborty, A.; Adhikary, S.; Bhattacharya, S.; Dutta, S.; Chatterjee, S.; Banerjee, D.; Ganguly, A.; Rajak, P. Pharmaceuticals and personal care products as emerging environmental contaminants: Prevalence, toxicity, and remedial approaches. ACS Chem. Health Saf. 2023, 30, 362–388. [Google Scholar] [CrossRef]
- Srain, H.S.; Beazley, K.F.; Walker, T.R. Pharmaceuticals and personal care products and their sublethal and lethal effects in aquatic organisms. Environ. Rev. 2021, 29, 142–181. [Google Scholar] [CrossRef]
- Chinen, K.; Malloy, T. Multi-strategy assessment of different uses of QSAR under REACH analysis of alternatives to advance information transparency. Int. J. Environ. Res. Public Health 2022, 19, 4338. [Google Scholar] [CrossRef]
- Belfield, S.J.; Firman, J.W.; Enoch, S.J.; Madden, J.C.; Tollefsen, K.E.; Cronin, M.T.D. A review of quantitative structure-activity relationship modelling approaches to predict the toxicity of mixtures. Comput. Toxicol. 2023, 25, 100251. [Google Scholar] [CrossRef]
- Halder, A.K.; Moura, A.S.; Cordeiro, M.N.D.S. Predicting the ecotoxicity of endocrine disruptive chemicals: Multitasking in silico approaches towards global models. Sci. Total Environ. 2023, 889, 164337. [Google Scholar] [CrossRef] [PubMed]
- Heo, S.; Safder, U.; Yoo, C. Deep learning driven QSAR model for environmental toxicology: Effects of endocrine disrupting chemicals on human health. Environ. Pollut. 2019, 253, 29–38. [Google Scholar] [CrossRef] [PubMed]
- Sheffield, T.Y.; Judson, R.S. Ensemble QSAR modeling to predict multispecies fish toxicity lethal concentrations and points of departure. Environ. Sci. Technol. 2019, 53, 12793–12802. [Google Scholar] [CrossRef]
- Na, M.; Nam, S.H.; Moon, K.; Kim, J. Development of a nano-QSAR model for predicting the toxicity of nano-metal oxide mixtures to Aliivibrio fischeri. Environ. Sci. Nano 2023, 10, 325–337. [Google Scholar] [CrossRef]
- Halder, A.K.; Cordeiro, M.N.D.S. Probing the environmental toxicity of deep eutectic solvents and their components: An in silico modeling approach. ACS Sustain. Chem. Eng. 2019, 7, 10649–10660. [Google Scholar] [CrossRef]
- Rybinska, A.; Sosnowska, A.; Grzonkowska, M.; Barycki, M.; Puzyn, T. Filling environmental data gaps with QSPR for ionic liquids: Modeling n-octanol/water coefficient. J. Hazard. Mater. 2016, 303, 137–144. [Google Scholar] [CrossRef] [PubMed]
- Khan, P.M.; Sanderson, H.; Roy, K. QSTR and interspecies-QSTR modelling for aquatic toxicity data gap filling of cationic polymers. Comput. Toxicol. 2021, 20, 100181. [Google Scholar] [CrossRef]
- Du, R.; Zhang, Q.; Wang, B.; Huang, J.; Deng, S.; Yu, G. Quantitative structure-activity relationship models for the reaction rate coefficients between dissolved organic matter and PPCPs. J. Hazard. Mater. 2023, 458, 131845. [Google Scholar] [CrossRef] [PubMed]
- Olker, J.H.; Elonen, C.M.; Pilli, A.; Anderson, A.; Kinziger, B.; Erickson, S.; LaLone, C.A.; Russom, C.L.; Hoff, D. The ECOTOXicology knowledgebase: A curated database of ecologically relevant toxicity tests to support environmental research and risk assessment. Environ. Toxicol. Chem. 2022, 41, 1520–1539. [Google Scholar] [CrossRef]
- Halder, A.K.; Moura, A.S.; Cordeiro, M.N.D.S. Moving average-based multitasking in silico classification modeling: Where do we stand and what is next? Int. J. Mol. Sci. 2022, 23, 4937. [Google Scholar] [CrossRef]
- Sushko, I.; Novotarskyi, S.; Körner, R.; Pandey, A.K.; Rupp, M.; Teetz, W.; Brandmaier, S.; Abdelaziz, A.; Prokopenko, V.V.; Tanchuk, V.Y.; et al. Online chemical modeling environment (OCHEM): Web platform for data storage, model development and publishing of chemical information. J. Comput.-Aided Mol. Des. 2011, 25, 533–554. [Google Scholar] [CrossRef]
- Menezes, F.; Popowicz, G.M. ULYSSES: An efficient and easy to use semiempirical library for C++. J. Chem. Inf. Model. 2022, 62, 3685–3694. [Google Scholar] [CrossRef] [PubMed]
- Halder, A.K.; Cordeiro, M.N.D.S. QSAR-Co-X: An open source toolkit for multitarget QSAR modelling. J. Cheminform. 2021, 13, 29. [Google Scholar] [CrossRef]
- Ambure, P.; Halder, A.K.; Gonzalez-Diaz, H.; Cordeiro, M.N.D.S. QSAR-Co: An open source software for developing robust multitasking or multitarget classification-based QSAR models. J. Chem. Inf. Model. 2019, 59, 2538–2544. [Google Scholar] [CrossRef]
- Gower, J.C. A comparison of some methods of cluster analysis. Biometrics 1967, 23, 623–637. [Google Scholar] [CrossRef]
- Beauchaine, T.P.; Beauchaine, R.J., 3rd. A comparison of maximum covariance and K-means cluster analysis in classifying cases into known taxon groups. Psychol. Methods 2002, 7, 245–261. [Google Scholar] [CrossRef] [PubMed]
- Zhao, L.; Fang, J.; Ji, Y.; Zhang, Y.; Zhou, X.; Yin, J.; Zhang, M.; Bao, W. K-means cluster analysis of characteristic patterns of allergen in different ages: Real life study. Clin. Transl. Allergy 2023, 13, e12281. [Google Scholar] [CrossRef] [PubMed]
- Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; et al. Scikit-learn: Machine learning in python. J. Mach. Learn. Res. 2011, 12, 2825–2830. [Google Scholar]
- Cover, T.; Hart, P. Nearest neighbor pattern classification. IEEE Trans. Inf. Theory 1967, 13, 21–27. [Google Scholar] [CrossRef]
- Breiman, L. Random forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
- Guang-Bin, H.; Babri, H.A. Upper bounds on the number of hidden neurons in feedforward networks with arbitrary bounded nonlinear activation functions. IEEE Trans. Neural Netw. 1998, 9, 224–229. [Google Scholar] [CrossRef] [PubMed]
- Friedman, J.H. Greedy function approximation: A gradient boosting machine. Ann. Stat. 2001, 29, 1189–1232. [Google Scholar] [CrossRef]
- Fawcett, T. An introduction to ROC analysis. Pattern Recognit. Lett. 2006, 27, 861–874. [Google Scholar] [CrossRef]
- Zou, Q.; Boughorbel, S.; Jarray, F.; El-Anbari, M. Optimal classifier for imbalanced data using Matthews Correlation Coefficient metric. PLoS ONE 2017, 12, e0177678. [Google Scholar] [CrossRef]
- Roy, K.; Kar, S.; Ambure, P. On a simple approach for determining applicability domain of QSAR models. Chemom. Intell. Lab. Syst. 2015, 145, 22–29. [Google Scholar] [CrossRef]
- Ambure, P.; Bhat, J.; Puzyn, T.; Roy, K. Identifying natural compounds as multi-target-directed ligands against Alzheimer’s disease: An in silico approach. J. Biomol. Struct. Dyn. 2018, 37, 1282–1306. [Google Scholar] [CrossRef] [PubMed]
- Kier, L.B.; Hall, L.H. An Electrotopological-state Index for Atoms in Molecules. Pharm. Res. 1990, 7, 801–807. [Google Scholar] [CrossRef]
- Wang, J.; Wang, W.; Huo, S.; Lee, M.; Kollman, P.A. Solvation Model based on weighted solvent accessible surface area. J. Phys. Chem. B. 2001, 105, 5055–5067. [Google Scholar] [CrossRef]
- Todeschini, R.; Gramatica, P. The Whim Theory: New 3D molecular descriptors for QSAR in environmental modelling. SAR QSAR Environ. Res. 1997, 7, 89–115. [Google Scholar] [CrossRef]
- Gramatica, P. WHIM descriptors of shape. QSAR Comb. Sci.. 2006, 25, 327–332. [Google Scholar] [CrossRef]
- Velázquez-Libera, J.L.; Caballero, J.; Toropova, A.P.; Toropov, A.A. Estimation of 2D autocorrelation descriptors and 2D Monte Carlo descriptors as a tool to build up predictive models for acetylcholinesterase (AChE) inhibitory activity. Chemom. Intell. Lab. Syst. 2019, 184, 14–21. [Google Scholar] [CrossRef]
Parameter | Equation |
---|---|
Sensitivity (Se) | |
Specificity (Sp) | |
Accuracy (Acc) | |
Mathew’s correlation coefficient (MCC) |
Model a | FS Technique | Scoring | CV | MCC | ||
---|---|---|---|---|---|---|
Sub-Training | Test | Average | ||||
L01 | FS | _____ | __ | 0.742 | 0.712 | 0.727 |
L02 | SFS | Accuracy | 0 | 0.815 | 0.713 | 0.764 |
L03 | SFS | AUROC | 0 | 0.748 | 0.773 | 0.761 |
L04 | SFS | Accuracy | 5 | 0.777 | 0.722 | 0.750 |
L05 | SFS | AUROC | 5 | 0.745 | 0.772 | 0.759 |
L06 | SFS | Accuracy | 10 | 0.745 | 0.722 | 0.734 |
L07 | SFS | AUROC | 10 | 0.754 | 0.772 | 0.763 |
L08 | GA | _____ | __ | 0.783 | 0.811 | 0.797 |
Model Equation | Sub-Training | Test | Validation | |
---|---|---|---|---|
= | +1.917 − 3.148 ∆(Ms)sn | TP: 316, TN: 256, | TP: 64, TN: 82, | TP: 120, TN:151, |
−1.426 ∆(MATS7v)sn | FP: 37, FN: 32, | FP: 8, FN: 7, | FP: 13, FN:64, | |
+1.597 ∆(MATS8v)sn | Se: 90.80, | Se: 90.14, | Se: 65.22, | |
−3.342 ∆(-C(=O)-[N, aromatic attach])sn | Sp: 87.37, | Sp: 91.11, | Sp: 92.07, | |
+5.001 ∆(E1s)sn + 1.464 ∆(GATS3m)sg | Acc: 89.23, | Acc: 90.68, | Acc: 77.87, | |
+7.486 ∆(Dm)te + 0.065 ∆(XLOGP2)sn | MCC: 0.783, | MCC: 0.811, | MCC: 0.589, | |
−19.523 ∆(De)co + 6.573 ∆(MATS1e)te | AUROC: 0.891 | AUROC: 0.906 | AUROC: 0.784 |
Model | FS Technique | Machine Learning | MCC | |||
---|---|---|---|---|---|---|
Sub-Training | Test | Average a | Validation b | |||
NL01 | L08 descriptors | RF | 0.813 | 0.824 | 0.819 | 0.469 |
NL02 | kNN | 0.739 | 0.713 | 0.726 | ND | |
NL03 | SVC | 0.745 | 0.811 | 0.778 | ND | |
NL04 | MLP | 0.780 | 0.736 | 0.758 | ND | |
NL05 | GB | 0.802 | 0.773 | 0.788 | ND | |
NL06 | All descriptors | RF | 0.824 | 0.864 | 0.844 | 0.527 |
NL07 | kNN | 0.683 | 0.646 | 0.665 | ND | |
NL08 | SVC | 0.414 | 0.460 | 0.437 | ND | |
NL09 | MLP | 0.673 | 0.710 | 0.692 | ND | |
NL10 | GB | 0.827 | 0.849 | 0.838 | 0.517 | |
NL11 | RFE | RF | 0.809 | 0.773 | 0.791 | ND |
NL12 | kNN | 0.745 | 0.773 | 0.759 | ND | |
NL13 | SVC | 0.764 | 0.773 | 0.769 | ND | |
NL14 | MLP | 0.733 | 0.723 | 0.728 | ND | |
NL15 | GB | 0.807 | 0.798 | 0.803 | 0.469 | |
NL16 | RF-importance | RF | 0.811 | 0.823 | 0.817 | 0.515 |
NL17 | kNN | 0.701 | 0.772 | 0.737 | ND | |
NL18 | SVC | 0.755 | 0.760 | 0.758 | ND | |
NL19 | MLP | 0.767 | 0.747 | 0.757 | ND | |
NL20 | GB | 0.796 | 0.735 | 0.766 | ND | |
NL21 | SFS | RF | 0.746 | 0.799 | 0.773 | ND |
NL22 | kNN | 0.714 | 0.700 | 0.707 | ND | |
NL23 | SVC | 0.736 | 0.749 | 0.743 | ND | |
NL24 | MLP | 0.647 | 0.686 | 0.667 | ND | |
NL25 | GB | 0.733 | 0.710 | 0.722 | ND | |
NL26 | GA-kNN | RF | 0.794 | 0.811 | 0.803 | 0.498 |
NL27 | kNN | 0.685 | 0.750 | 0.718 | ND | |
NL28 | SVC | 0.732 | 0.811 | 0.772 | ND | |
NL29 | MLP | 0.749 | 0.774 | 0.762 | ND | |
NL30 | GB | 0.796 | 0.849 | 0.823 | 0.532 |
Deviation Descriptor | Description of Core Descriptor |
---|---|
∆(Ms)sn | Ms: Mean electrotopological state |
∆(Dm)te | Dm: D total accessibility index, weighted by mass |
∆(E1s)sn | E1s: 1st component accessibility directional WHIM index, weighted by I-state |
∆(MATS1e)te | MATS1e: Moran autocorrelation of lag 1, weighted by Sanderson electronegativity |
∆(De)co | De: D total accessibility index, weighted by Sanderson electronegativity |
∆(XLOGP2)sn | XLOGP2: Squared Wang octanol water partition coefficient |
∆(GATS3m)sg | GATS3m: Geary autocorrelation of lag 3, weighted by atomic masses |
∆(MATS7v)sn | MATS7v: Moran autocorrelation of lag 7, weighted by atomic van der Waals volumes |
∆(MATS8v)sn | MATS8v: Moran autocorrelation of lag 8, weighted by atomic van der Waals volumes |
∆(CONA)co | CONA: C(=O)-[nitrogen, aromatic attach] |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Halder, A.K.; Pradhan, T.; Cordeiro, M.N.D.S. Predicting the Aquatic Toxicity of Pharmaceutical and Personal Care Products: A Multitasking Modeling Approach. Appl. Sci. 2025, 15, 1246. https://doi.org/10.3390/app15031246
Halder AK, Pradhan T, Cordeiro MNDS. Predicting the Aquatic Toxicity of Pharmaceutical and Personal Care Products: A Multitasking Modeling Approach. Applied Sciences. 2025; 15(3):1246. https://doi.org/10.3390/app15031246
Chicago/Turabian StyleHalder, Amit Kumar, Tanushree Pradhan, and M. Natália D. S. Cordeiro. 2025. "Predicting the Aquatic Toxicity of Pharmaceutical and Personal Care Products: A Multitasking Modeling Approach" Applied Sciences 15, no. 3: 1246. https://doi.org/10.3390/app15031246
APA StyleHalder, A. K., Pradhan, T., & Cordeiro, M. N. D. S. (2025). Predicting the Aquatic Toxicity of Pharmaceutical and Personal Care Products: A Multitasking Modeling Approach. Applied Sciences, 15(3), 1246. https://doi.org/10.3390/app15031246