# Exploring Dimensionality Reduction Techniques for Deep Learning Driven QSAR Models of Mutagenicity

^{1}

^{2}

^{3}

^{4}

^{5}

^{6}

^{*}

## Abstract

**:**

## 1. Introduction

^{4}[8]. Principal component analysis (PCA) was used to reduce dimensionality to within the 10

^{2}order of magnitude, which enabled overall accuracy scores of ~70% (for individual models using single types of feature vector) and ~78% (for combined models, using both types of feature vector) [8]. Although demonstrably effective, PCA is a linear dimensionality reduction technique that may fail to sufficiently conserve any information that exists across higher dimensional manifolds; the authors of the paper indeed acknowledged this limitation and noted that a potential expansion of their study could entail the exploration of alternative dimensionality reduction techniques, particularly non-linear techniques [8]. Other studies involving deep learning driven QSAR models have made use of dimensionality reduction techniques such as PCA, genetic algorithms, locally linear embedding (LLE), autoencoders and others, however the choice of specific algorithm has frequently been trivial and relatively few comparisons have been drawn, hence a research gap has been identified for specifically exploring optimisation and performance of dimensionality reduction algorithms in this space [7,8,9,10,11].

## 2. Materials and Methods

#### 2.1. Hypothesis and Research Method Overview

#### 2.2. Data Collection and Pre-Processing

^{2}) TCs, with each row as a feature vector (each with a dimensionality of 11,268, to be later reduced) for each molecule as a sample; this may be visualised below in Figure 1:

#### 2.3. Overview of Dimensionality Reduction Techniques

#### 2.4. Grid Search for Hyperparameter Optimisation

- PCA, as a comparatively simpler technique and with already demonstrated effectiveness via default values in our previous research (which used the same dataset and closely comparable feature engineering), was neglected from grid search efforts; default values for more trivial hyperparameters were simply used [8,20,24].
- kPCA was similarly not considered for a grid search, although 2 different kernels were used for comparison purposes: an RBF function and a sigmoid function. Aside from this, default values for more trivial hyperparameters were used [24].
- The FastICA algorithm was used for ICA, with whitening strategy varied between use of arbitrary variance (default) and use of unit variance (note that it was assumed that the data was not already whitened), as well as maximum number of allowed iterations of the algorithm linearly varied between 200 (default) and 1000, with steps of size 100. Aside from this, default values were used for other hyperparameters [24].
- Autoencoders were configured using the sigmoid activation function in encoder neurons, whereas the ReLU activation function was used in the decoder neurons. The Adam optimisation algorithm was used for stochastic optimisation, along with mean squared error to compute losses. Shuffling of data between epochs was enabled, whereas the number of epochs was varied geometrically between 10 and 1280 (doubling with each step). The number of layers in the encoder and decoder were kept equal and varied via “number of steps to bottleneck” which referred to the number of layers before and after the bottleneck layer, for the encoder and decoder respectively; this number was varied linearly between 1 (for the most simple type of autoencoder with 3 layers in total) and 5 (a more complex autoencoder with 11 layers in total). For each given value for number of steps to bottleneck, the size of the first and last autoencoder layer would always be 11,268 (the original dimensionality), whereas the size of the bottleneck layer would be that of the chosen latent space, however the sizes of the layers between was configured as a geometric series—e.g., for the decoder layers to start from the latent space size and reach 11,268 within the chosen number of steps, maintaining the same common ratio each time. This common ratio naturally changed for different values for number of steps to bottleneck layer and hence was recalculated for these different values. Each common ratio was equivalently used for configuring the sizes of encoder layers, except via division rather than multiplication (as the geometric series would operate in reverse). Any decimal values for layer sizes obtained via these geometric series were simply rounded to the nearest whole number.
- LLE was varied in terms of strength of regularisation constant and number of neighbours considered for each point. The strength of the regularisation constant was varied geometrically between 10
^{−6}and 10^{−2}(note that the default value was 10^{−3}), with a common ratio of 10. The number of neighbours considered was varied linearly between 5 (default) and 115, with step sizes of 10. Aside from this, default values were used for other hyperparameters [24]. - Isomap was varied in terms of number of neighbours considered for each point, as well as eigenvalue decomposition algorithm. The number of neighbours considered was varied in an identical manner to that of LLE, whereas eigenvalue decomposition algorithm was varied between Arnoldi decomposition and the LAPACK (linear algebra package) solver [24]. It should be noted that for Arnoldi decomposition, it was further necessary to specify a maximum number of iterations, which was hence varied in an identical manner to that used for the ICA hyperparameter grid search [24]. Aside from this, default values were used for other hyperparameters [24].

#### 2.5. Final Comparison of Dimensionality Reduction Techniques

#### 2.6. Defining the Applicability Domain

## 3. Results and Discussion

#### 3.1. Grid Search

^{−6}, whereas the number of neighbours would be controlled at 115 (as this was the lowest number of neighbours, with the lowest uncertainty value, which resulted in the most optimal performance when using a regularisation constant of 10

^{−6}).

#### 3.2. Comparative Performances of Dimensionality Reduction Techniques

#### 3.3. Analysis of Applicability Domain

## 4. Conclusions

## Author Contributions

## Funding

## Institutional Review Board Statement

## Informed Consent Statement

## Data Availability Statement

## Acknowledgments

## Conflicts of Interest

## Appendix A

**Figure A1.**Bar chart of frequencies of the top 50 most abundant ToxPrint chemotypes, for the curated 11,268 molecules used in the study, using the ToxPrint ChemoTyper software tool [28].

**Figure A2.**Histograms characterising the distribution of the 11,268 curated molecules in terms of: (

**a**) Number of atoms (excluding H atoms); (

**b**) Molecular weight (g/mol), according to PubChem [13]; Note that in both cases, 50 bins were used.

## Appendix B

**Figure A3.**Comparative performance metrics graphs for all optimised dimensionality reduction techniques, over ascending dimensionalities, for molecules outside of the defined AD, in terms of: (

**a**) Sensitivity (

**b**) Specificity; (

**c**) PPV (positive predictive value); (

**d**) NPV (negative predictive value).

## Appendix C

**Figure A4.**Alternatively coloured version of Figure 9, for the perusal of colourblind readers; Visualisations of the distributions in chemical space of; (

**a**) Molecules in the dataset that were mutagenic versus non-mutagenic; (

**b**) Molecules in the dataset that were mainly predicted as mutagenic versus mainly predicted as non-mutagenic (according to the autoencoder powered QSAR model at 300 dimensions, across all iterations); Note that in all cases, median position of the entire dataset is marked, for reference.

## References

- Larsen, J.C. Risk assessment of chemicals in European traditional foods. Trends Food Sci. Technol.
**2006**, 17, 471–481. [Google Scholar] [CrossRef] - Escher, S.E.; Kamp, H.; Bennekou, S.H.; Bitsch, A.; Fisher, C.; Graepel, R.; Hengstler, J.G.; Herzler, M.; Knight, D.; Leist, M.; et al. Towards grouping concepts based on new approach methodologies in chemical hazard assessment: The read-across approach of the EU-ToxRisk project. Arch. Toxicol.
**2019**, 93, 3643–3667. [Google Scholar] [CrossRef] [PubMed] - Gramatica, P. On the development and validation of QSAR models. In Computational Toxicology; Springer: Berlin/Heidelberg, Germany, 2013; pp. 499–526. [Google Scholar] [CrossRef]
- Honma, M.; Kitazawa, A.; Cayley, A.; Williams, R.V.; Barber, C.; Hanser, T.; Saiakhov, R.; Chakravarti, S.; Myatt, G.J.; Cross, K.P.; et al. Improvement of quantitative structure–activity relationship (QSAR) tools for predicting Ames mutagenicity: Outcomes of the Ames/QSAR International Challenge Project. Mutagenesis
**2019**, 34, 3–16. [Google Scholar] [CrossRef] [PubMed] - Kumar, R.; Khan, F.U.; Sharma, A.; Siddiqui, M.H.; Aziz, I.B.; Kamal, M.A.; Ashraf, G.; Alghamdi, B.S.; Uddin, S. A deep neural network–based approach for prediction of mutagenicity of compounds. Environ. Sci. Pollut. Res.
**2021**, 28, 47641–47650. [Google Scholar] [CrossRef] [PubMed] - Hung, C.; Gini, G. QSAR modeling without descriptors using graph convolutional neural networks: The case of mutagenicity prediction. Mol. Divers.
**2021**, 25, 1283–1299. [Google Scholar] [CrossRef] - Idakwo, G.; Iv, J.L.; Chen, M.; Hong, H.; Gong, P.; Zhang, C. A Review of Feature Reduction Methods for QSAR-Based Toxicity Prediction; Springer International Publishing: Midtown Manhattan, NY, USA, 2019; pp. 119–139. [Google Scholar] [CrossRef]
- Kalian, A.D.; Benfenati, E.; Osborne, O.J.; Dorne, J.-L.C.M.; Gott, D.; Potter, C.P.; Guo, M.; Hogstrand, C. Improving accuracy scores of neural network driven QSAR models of mutagenicity. In Proceedings of the 33rd European Symposium on Computer Aided Process Engineering: ESCAPE-33, Athens, Greece, 18–21 June 2023; Elsevier: Amsterdam, The Netherlands, 2023; p. 846, in press. [Google Scholar]
- Kausar, S.; Falcao, A.O. Analysis and Comparison of Vector Space and Metric Space Representations in QSAR Modeling. Molecules
**2019**, 24, 1698. [Google Scholar] [CrossRef] - Alsenan, S.; Al-Turaiki, I.; Hafez, A. Autoencoder-based Dimensionality Reduction for QSAR Modeling. In Proceedings of the 2020 3rd International Conference on Computer Applications & Information Security (ICCAIS), Riyadh, Saudi Arabia, 19–21 March 2020; pp. 1–4. [Google Scholar] [CrossRef]
- L’heureux, P.-J.; Carreau, J.; Bengio, Y.; Delalleau, O.; Yue, S.Y. Locally Linear Embedding for dimensionality reduction in QSAR. J. Comput. Mol. Des.
**2004**, 18, 475–482. [Google Scholar] [CrossRef] - Cover, T.M. Geometrical and Statistical Properties of Systems of Linear Inequalities with Applications in Pattern Recognition. IEEE Trans. Electron. Comput.
**1965**, EC-14, 326–334. [Google Scholar] [CrossRef] - Kim, S.; Thiessen, P.A.; Bolton, E.E.; Chen, J.; Fu, G.; Gindulyte, A.; Han, L.; He, J.; He, S.; Shoemaker, B.A.; et al. PubChem substance and compound databases. Nucleic Acids Res.
**2016**, 44, D1202–D1213. [Google Scholar] [CrossRef] [PubMed] - Swain, M. MolVS: Molecule Validation and Standardization. 2016. Available online: https://molvs.readthedocs.io/en/latest/Revisiond815fe52 (accessed on 11 May 2023).
- Landrum, G. RDKit: Open-Source Cheminformatics Software. 2016. Available online: https://www.rdkit.org/ (accessed on 11 May 2023).
- De, P.; Kar, S.; Ambure, P.; Roy, K. Prediction reliability of QSAR models: An overview of various validation tools. Arch. Toxicol.
**2022**, 96, 1279–1295. [Google Scholar] [CrossRef] [PubMed] - Cereto-Massagué, A.; Ojeda, M.J.; Valls, C.; Mulero, M.; Garcia-Vallvé, S.; Pujadas, G. Molecular fingerprint similarity search in virtual screening. Methods
**2015**, 71, 58–63. [Google Scholar] [CrossRef] [PubMed] - Roweis, S.T.; Saul, L.K. Nonlinear Dimensionality Reduction by Locally Linear Embedding. Science
**2000**, 290, 2323–2326. [Google Scholar] [CrossRef] [PubMed] - Schölkopf, B.; Smola, A.; Müller, K.R. Kernel principal component analysis. In Proceedings of the Artificial Neural Networks—ICANN’97: 7th International Conference, Lausanne, Switzerland, 8–10 October 1997; Springer: Berlin/Heidelberg, Germany, 1997; pp. 583–588. [Google Scholar] [CrossRef]
- Abdi, H.; Williams, L.J. Principal component analysis. Wiley Interdiscip. Rev. Comput. Stat.
**2010**, 2, 433–459. [Google Scholar] [CrossRef] - Comon, P. Independent component analysis, a new concept? Signal Process.
**1994**, 36, 287–314. [Google Scholar] [CrossRef] - Wang, Y.; Yao, H.; Zhao, S. Auto-encoder based dimensionality reduction. Neurocomputing
**2016**, 184, 232–242. [Google Scholar] [CrossRef] - Tenenbaum, J.B.; de Silva, V.; Langford, J.C. A Global Geometric Framework for Nonlinear Dimensionality Reduction. Science
**2000**, 290, 2319–2323. [Google Scholar] [CrossRef] [PubMed] - Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Duchesnay, E. Scikit-learn: Machine learning in Python. J. Mach. Learn. Res.
**2011**, 12, 2825–2830. [Google Scholar] - Abadi, M.; Agarwal, A.; Barham, P.; Brevdo, E.; Chen, Z.; Citro, C.; Corrado, G.S.; Davis, A.; Dean, J.; Devin, M.; et al. TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems. 2015. Available online: https://www.tensorflow.org (accessed on 11 May 2023).
- Feurer, M.; Hutter, F. Hyperparameter Optimization. Automated Machine Learning: Methods, Systems, Challenges; Springer Nature: Singapore, 2019; pp. 3–33. [Google Scholar]
- Sahigara, F.; Mansouri, K.; Ballabio, D.; Mauri, A.; Consonni, V.; Todeschini, R. Comparison of Different Approaches to Define the Applicability Domain of QSAR Models. Molecules
**2012**, 17, 4791–4810. [Google Scholar] [CrossRef] [PubMed] - Yang, C.; Tarkhov, A.; Marusczyk, J.; Bienfait, B.; Gasteiger, J.; Kleinoeder, T.; Magdziarz, T.; Sacher, O.; Schwab, C.H.; Schwoebel, J.; et al. New Publicly Available Chemical Query Language, CSRML, To Support Chemotype Representations for Application to Data Mining and Modeling. J. Chem. Inf. Model.
**2015**, 55, 510–528. [Google Scholar] [CrossRef] [PubMed]

**Figure 1.**Schematic diagram of how an N × N matrix of TCs was constructed as a feature space for N molecules, along with corresponding feature vectors for each molecule. In this study, N = 11,268.

**Figure 2.**For ICA: (

**a**) Heat map of mean overall accuracy scores obtained from grid search, over all 5 iterations; (

**b**) Complementary heat map of percentage uncertainty values (standard deviation) on mean overall accuracy scores.

**Figure 3.**For autoencoders: (

**a**) Heat map of mean overall accuracy scores obtained from grid search, over all 5 iterations; (

**b**) Complementary heat map of percentage uncertainty values (standard deviation) on mean overall accuracy scores.

**Figure 4.**For LLE: (

**a**) Heat map of mean overall accuracy scores obtained from grid search, over all 5 iterations; (

**b**) Complementary heat map of percentage uncertainty values (standard deviation) on mean overall accuracy scores.

**Figure 5.**For isomap: (

**a**) Heat map of mean overall accuracy scores obtained from grid search, over all 5 iterations; (

**b**) Complementary heat map of percentage uncertainty values (standard deviation) on mean overall accuracy scores.

**Figure 6.**Comparative performance metrics graphs for all optimised dimensionality reduction techniques, over ascending dimensionalities, in terms of: (

**a**) Overall accuracy; (

**b**) Sensitivity; (

**c**) Specificity; (

**d**) PPV (positive predictive value); (

**e**) NPV (negative predictive value); (Note that for aiding clarity in comparing more pertinent high performance results at higher dimensionalities, y-axis cutoffs exclude lower performances at lower dimensionalities.

**Figure 7.**Visualisations of the AD and the positions of testing data molecules relative to it, for: (

**a**) Iteration 0 (i.e., using fold no. 1 as testing data); (

**b**) Iteration 1 (i.e., using fold no. 2 as testing data); (

**c**) Iteration 2 (i.e., using fold no. 3 as testing data); (

**d**) Iteration 3 (i.e., using fold no. 4 as testing data); (

**e**) Iteration 4 (i.e., using fold no. 5 as testing data); Note that in all cases, total testing data was composed of the given test fold combined with a separate pool of permanently assigned testing data.

**Figure 8.**Line graph displaying correct classification rate (equivalent to overall accuracy) of molecules outside the AD for each given iteration, by QSAR models, over different reduced dimensionalities (and powered by different dimensionality reduction techniques).

**Figure 9.**Visualisations of the distributions in chemical space of: (

**a**) Molecules in the dataset that were mutagenic versus non-mutagenic; (

**b**) Molecules in the dataset that were mainly predicted as mutagenic versus mainly predicted as non-mutagenic (according to the autoencoder powered QSAR model at 300 dimensions, across all iterations); Note that in all cases, median position of the entire dataset is marked, for reference. For an alternative colour-scheme, see Appendix C.

**Figure 10.**Histograms of Euclidean distances from the median point in XLogP/MW chemical space for: (

**a**) Molecules in the dataset that were mutagenic versus non-mutagenic; (

**b**) Molecules in the dataset that were mainly predicted as mutagenic versus mainly predicted as non-mutagenic (according to the autoencoder powered QSAR model at 300 dimensions, across all iterations); Note that 1000 bins were used for producing both histograms.

**Figure 11.**Graphical plots showcasing: (

**a**) Average correct classification rate for molecules, over binned Euclidean distances from the median point in XLogP/MW chemical space (note that 30 bins were used and that some spaces between plotted points are uneven, due to exclusion of empty bins); (

**b**) Average correct classification rate for molecules within maximum thresholds for Euclidean distances from the median point in XLogP/MW chemical space (note that 1000 thresholds were used); Also note that both graphs are concerning QSAR models that used 300-dimensional data, over all iterations.

**Figure 12.**Heat maps of chemical space region-specific overall QSAR model performance at 300 dimensions, over all iterations, for: (

**a**) PCA; (

**b**) kPCA (sigmoid function); (

**c**) kPCA (RBF); (

**d**) ICA; (

**e**) Autoencoders; (

**f**) LLE; (

**g**) Isomap; A further plot (

**h**) displays whether linear (red with no lines) or non-linear (blue with diagonal lines) techniques or both equally (purple with dotted pattern) performed most optimally, for the given regions; Note that in all cases, chemical space was discretised into 30 equal bins of MW, along with 30 equal bins of XLogP, with empty bins of no data coloured grey.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Kalian, A.D.; Benfenati, E.; Osborne, O.J.; Gott, D.; Potter, C.; Dorne, J.-L.C.M.; Guo, M.; Hogstrand, C.
Exploring Dimensionality Reduction Techniques for Deep Learning Driven QSAR Models of Mutagenicity. *Toxics* **2023**, *11*, 572.
https://doi.org/10.3390/toxics11070572

**AMA Style**

Kalian AD, Benfenati E, Osborne OJ, Gott D, Potter C, Dorne J-LCM, Guo M, Hogstrand C.
Exploring Dimensionality Reduction Techniques for Deep Learning Driven QSAR Models of Mutagenicity. *Toxics*. 2023; 11(7):572.
https://doi.org/10.3390/toxics11070572

**Chicago/Turabian Style**

Kalian, Alexander D., Emilio Benfenati, Olivia J. Osborne, David Gott, Claire Potter, Jean-Lou C. M. Dorne, Miao Guo, and Christer Hogstrand.
2023. "Exploring Dimensionality Reduction Techniques for Deep Learning Driven QSAR Models of Mutagenicity" *Toxics* 11, no. 7: 572.
https://doi.org/10.3390/toxics11070572