# Should We Embed in Chemistry? A Comparison of Unsupervised Transfer Learning with PCA, UMAP, and VAE on Molecular Fingerprints

^{1}

^{2}

^{3}

^{4}

^{5}

^{*}

## Abstract

**:**

## 1. Introduction

## 2. Results and Discussion

#### 2.1. Setting the Baseline

#### 2.2. Embedding Chemical Spaces

#### 2.3. Impact of Embedding Size and Information Content

#### 2.4. Internal versus External Knowledge

#### 2.5. Should We Embed? Does Embedding Win over Baseline?

#### 2.6. Insights into Latent Representations

## 3. Materials and Methods

#### 3.1. Data

#### 3.2. Machine Learning Methods

#### 3.3. Transfer Learning with Embeddings

#### 3.3.1. Principal Component Analysis (PCA)

**X**~(n,p)) are represented by the product of two matrices, namely the scores (

**T**~(n,k)) and the loadings (

**P**~(p,k)), Equation (3):

**E**~(n,p) is the residual matrix and n, p, and k are the number of samples, variables, and components, respectively. The parameters are estimated to capture as much of the variance in the original data in a least squares sense, and further to be orthogonal matrices, i.e.,

**T**and

**P**are referred to as principal components, and used in various ways in, e.g., exploratory data analysis to map the multivariate sample distribution as well as interrogating feature2feature correlation structure, as well as—like in this work—to represent the data in a few meaningful features used for further analysis. A rewrite of Equation (4) above shows that the score space (

**T**) is a linear mapping by the orthogonal basis represented by

**P**:

**T**=

**XP**, and hence a rotation of the coordinate system as depicted in Figure 7.

#### 3.3.2. Uniform Manifold Approximation and Projection (UMAP)

#### 3.3.3. Variational Autoencoders (VAE)

#### 3.3.4. Embedder Training

#### 3.3.5. Modeling

**(1)**data for CS1 and CS2 are loaded, where CS2 involves the Tox21 modeling data (fingerprints—FPs, labels/endpoints) and CS2 fingerprints;

**(2)**FPs columns for both CS1 and CS2 below 5% variance are removed;

**(3)**removal of structures from CS2 which appear in CS1;

**(4)**train and apply embedders; and

**(5)**optimize classification models and apply them on embedded data, as shown in Table 6.

## 4. Limitations and Future Outlook

## 5. Conclusions

## Author Contributions

## Funding

## Institutional Review Board Statement

## Informed Consent Statement

## Data Availability Statement

## Acknowledgments

## Conflicts of Interest

## References

- David, L.; Thakkar, A.; Mercado, R.; Engkvist, O. Molecular representations in AI-driven drug discovery: A review and practical guide. J. Cheminform.
**2020**, 12, 56. [Google Scholar] [CrossRef] [PubMed] - Ghasemi, F.; Mehridehnavi, A.; Pérez-Garrido, A.; Pérez-Sánchez, H. Neural network and deep-learning algorithms used in QSAR studies: Merits and drawbacks. Drug Discov. Today
**2018**, 23, 1784–1790. [Google Scholar] [CrossRef] [PubMed] - Mayr, A.; Klambauer, G.; Unterthiner, T.; Hochreiter, S. DeepTox: Toxicity prediction using deep learning. Front. Environ. Sci.
**2016**, 3. [Google Scholar] [CrossRef] [Green Version] - Prykhodko, O.; Johansson, S.V.; Kotsias, P.-C.; Arús-Pous, J.; Bjerrum, E.J.; Engkvist, O.; Chen, H. A de novo molecular generation method using latent vector based generative adversarial network. J. Cheminform.
**2019**, 11, 74. [Google Scholar] [CrossRef] [Green Version] - Lusci, A.; Pollastri, G.; Baldi, P. Deep architectures and deep learning in chemoinformatics: The prediction of aqueous solubility for drug-like molecules. J. Chem. Inf. Model.
**2013**, 53, 1563–1575. [Google Scholar] [CrossRef] [PubMed] [Green Version] - Capuccini, M.; Ahmed, L.; Schaal, W.; Laure, E.; Spjuth, O. Large-scale virtual screening on public cloud resources with Apache Spark. J. Cheminformatics
**2017**, 9, 15. [Google Scholar] [CrossRef] [Green Version] - Lovrić, M.; Molero, J.M.; Kern, R. PySpark and RDKit: Moving towards big data in cheminformatics. Mol. Inform.
**2019**, 38, e1800082. [Google Scholar] [CrossRef] - Tetko, I.V.; Engkvist, O.; Chen, H. Does “Big Data” exist in medicinal chemistry, and if so, how can it be harnessed? Future Med. Chem.
**2016**, 8, 1801–1806. [Google Scholar] [CrossRef] [Green Version] - Chen, H.; Kogej, T.; Engkvist, O. Cheminformatics in drug discovery, an industrial perspective. Mol. Inform.
**2018**, 37. [Google Scholar] [CrossRef] - Rogers, D.; Hahn, M. Extended-connectivity fingerprints. J. Chem. Inf. Model.
**2010**, 50, 742–754. [Google Scholar] [CrossRef] - Jaeger, S.; Fulle, S.; Turk, S. Mol2vec: Unsupervised machine learning approach with chemical intuition. J. Chem. Inf. Model.
**2018**, 58, 27–35. [Google Scholar] [CrossRef] [PubMed] - Jiang, D.; Wu, Z.; Hsieh, C.-Y.; Chen, G.; Liao, B.; Wang, Z.; Shen, C.; Cao, D.; Wu, J.; Hou, T. Could graph neural networks learn better molecular representation for drug discovery? A comparison study of descriptor-based and graph-based models. J. Cheminform.
**2021**, 13, 12. [Google Scholar] [CrossRef] [PubMed] - Lovrić, M.; Malev, O.; Klobučar, G.; Kern, R.; Liu, J.; Lučić, B. Predictive capability of QSAR models based on the CompTox zebrafish embryo assays: An imbalanced classification problem. Molecules
**2021**, 26, 1617. [Google Scholar] [CrossRef] [PubMed] - Abdelaziz, A.; Spahn-Langguth, H.; Schramm, K.-W.; Tetko, I.V. Consensus modeling for HTS assays using in silico descriptors calculates the best balanced accuracy in Tox21 challenge. Front. Environ. Sci.
**2016**, 4, 2. [Google Scholar] [CrossRef] [Green Version] - Idakwo, G.; Thangapandian, S.; Luttrell, J.; Li, Y.; Wang, N.; Zhou, Z.; Hong, H.; Yang, B.; Zhang, C.; Gong, P. Structure–Activity relationship-based chemical classification of highly imbalanced Tox21 datasets. J. Cheminform.
**2020**, 12, 66. [Google Scholar] [CrossRef] [PubMed] - Lovrić, M.; Pavlović, K.; Žuvela, P.; Spataru, A.; Lučić, B.; Kern, R.; Wong, M.W. Machine learning in prediction of intrinsic aqueous solubility of drug-like compounds: Generalization, complexity, or predictive ability? J. Chemom.
**2021**, e3349. [Google Scholar] [CrossRef] - Bellman, R.E. Dynamic programming. Science
**1966**, 153, 34–37. [Google Scholar] [CrossRef] [PubMed] - Aggarwal, C.C.; Hinneburg, A.; Keim, D.A. On the surprising behavior of distance metrics in high dimensional space. In Database Theory—ICDT 2001. Lecture Notes in Computer Science; van den Bussche, J., Vianu, V., Eds.; Springer: Berlin/Heidelberg, Germany, 2001; Volume 1973. [Google Scholar] [CrossRef] [Green Version]
- Geng, X.; Zhan, D.-C.; Zhou, Z.-H. Supervised nonlinear dimensionality reduction for visualization and classification. IEEE Trans. Syst. Man Cybern. Part B
**2005**, 35, 1098–1107. [Google Scholar] [CrossRef] [PubMed] [Green Version] - Sakurada, M.; Yairi, T. Anomaly detection using autoencoders with nonlinear dimensionality reduction. In Proceedings of the MLSDA 2014 2nd Workshop on Machine Learning for Sensory Data Analysis—MLSDA’14, Gold Coast, QLD, Australia, 2 December 2014; p. 4. [Google Scholar]
- Duricic, T.; Hussain, H.; Lacic, E.; Kowald, D.; Helic, D.; Lex, E. Empirical comparison of graph embeddings for trust-based collaborative filtering. In Proceedings of the 25th International Symposium on Methodologies for Intelligent Systems, Graz, Austria, 23–25 September 2020. [Google Scholar]
- Blei, D.M.; Ng, A.Y.; Jordan, M.I. Latent Dirichlet allocation. J. Mach. Learn. Res
**2003**, 3, 993–1022. [Google Scholar] - Choi, S. Algorithms for orthogonal nonnegative matrix factorization. In Proceedings of the 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence), Hong Kong, China, 1–6 June 2008; pp. 1828–1832. [Google Scholar]
- Sampson, G.; Rumelhart, D.E.; McClelland, J.L. The PDP research group parallel distributed processing: Explorations in the microstructures of cognition. Language
**1987**, 63, 871. [Google Scholar] [CrossRef] - Hotelling, H. Analysis of a complex of statistical variables into principal components. J. Educ. Psychol.
**1933**, 24, 417–441. [Google Scholar] [CrossRef] - Van der Maaten, L. Hinton G visualizing data using t-SNE. J. Mach. Learn. Res.
**2008**, 9, 2579–2605. [Google Scholar] - Belkin, M.; Niyogi, P. Laplacian Eigenmaps for dimensionality reduction and data representation. Neural Comput.
**2003**, 15, 1373–1396. [Google Scholar] [CrossRef] [Green Version] - McInnes, L.; Healy, J.; Melville, J. UMAP: Uniform Manifold Approximation and Projection for dimension reduction. J. Open Source Softw.
**2018**, 3, 861. [Google Scholar] [CrossRef] - Shrivastava, A.; Kell, D. FragNet, a contrastive learning-based transformer model for clustering, interpreting, visualizing, and navigating chemical space. Molecules
**2021**, 26, 2065. [Google Scholar] [CrossRef] - Probst, D.; Reymond, J.-L. Visualization of very large high-dimensional data sets as minimum spanning trees. J. Cheminformatics
**2020**, 12, 12. [Google Scholar] [CrossRef] [PubMed] [Green Version] - Becht, E.; McInnes, L.; Healy, J.; Dutertre, C.-A.; Kwok, I.W.H.; Ng, L.G.; Ginhoux, F.; Newell, E.W. Dimensionality reduction for visualizing single-cell data using UMAP. Nat. Biotechnol.
**2019**, 37, 38–44. [Google Scholar] [CrossRef] [PubMed] - Obermeier, M.M.; Wicaksono, W.A.; Taffner, J.; Bergna, A.; Poehlein, A.; Cernava, T.; Lindstaedt, S.; Lovric, M.; Bogotá, C.A.M.; Be1rg, G. Plant resistome profiling in evolutionary old bog vegetation provides new clues to understand emergence of multi-resistance. ISME J.
**2021**, 15, 921–937. [Google Scholar] [CrossRef] [PubMed] - Bengio, Y.; Lamblin, P.; Popovici, D.; Larochelle, H. Greedy layer-wise training of deep networks. Adv. Neural. Inf. Process. Syst.
**2007**, 153–160. [Google Scholar] [CrossRef] [Green Version] - Kingma, D.P.; Welling, M. Auto-encoding variational bayes. In Proceedings of the 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, 14–16 April 2014. [Google Scholar]
- Kwon, Y.; Yoo, J.; Choi, Y.-S.; Son, W.-J.; Lee, D.; Kang, S. Efficient learning of non-autoregressive graph variational autoencoders for molecular graph generation. J. Cheminformatics
**2019**, 11, 70. [Google Scholar] [CrossRef] [PubMed] [Green Version] - Bjerrum, E.J.; Sattarov, B. Improving chemical autoencoder latent space and molecular de novo generation diversity with heteroencoders. Biomolecules
**2018**, 8, 131. [Google Scholar] [CrossRef] [Green Version] - Zhang, J.; Mucs, D.; Norinder, U.; Svensson, F. LightGBM: An effective and scalable algorithm for prediction of chemical toxicity–application to the Tox21 and mutagenicity data sets. J. Chem. Inf. Model.
**2019**, 59, 4150–4158. [Google Scholar] [CrossRef] [PubMed] - Ding, J.; Li, X.; Gudivada, V.N. Augmentation and evaluation of training data for deep learning. In Proceedings of the 2017 IEEE International Conference on Big Data (IEEE Big Data 2017), Boston, MA, USA, 11–14 December 2017; pp. 2603–2611. [Google Scholar]
- Ehuang, R.; Exia, M.; Nguyen, D.-T.; Ezhao, T.; Esakamuru, S.; Ezhao, J.; Shahane, S.A.; Erossoshek, A.; Esimeonov, A. Tox21Challenge to build predictive models of nuclear receptor and stress response pathways as mediated by exposure to environmental chemicals and drugs. Front. Environ. Sci.
**2016**, 3, 85. [Google Scholar] [CrossRef] [Green Version] - Fernandez, M.; Ban, F.; Woo, G.; Hsing, M.; Yamazaki, T.; Leblanc, E.; Rennie, P.S.; Welch, W.J.; Cherkasov, A. Toxic colors: The use of deep learning for predicting toxicity of compounds merely from their graphic images. J. Chem. Inf. Model.
**2018**, 58, 1533–1543. [Google Scholar] [CrossRef] [PubMed] - Hemmerich, J.; Asilar, E.; Ecker, G. Conformational oversampling as data augmentation for molecules. In Transactions on Petri Nets and Other Models of Concurrency XV; Springer Science and Business Media LLC: Berlin/Heidelberg, Germany, 2019; pp. 788–792. [Google Scholar]
- Klimenko, K.; Rosenberg, S.A.; Dybdahl, M.; Wedebye, E.B.; Nikolov, N.G. QSAR modelling of a large imbalanced aryl hydrocarbon activation dataset by rational and random sampling and screening of 80,086 REACH pre-registered and/or registered substances. PLoS ONE
**2019**, 14, e0213848. [Google Scholar] [CrossRef] [PubMed] - Fourches, D.; Muratov, E.; Tropsha, A. Trust, but verify: On the importance of chemical structure curation in cheminformatics and QSAR modeling research. J. Chem. Inf. Model.
**2010**, 50, 1189–1204. [Google Scholar] [CrossRef] - Greg Landrum, RDKit. Available online: http://rdkit.org (accessed on 21 May 2020).
- Gütlein, M.; Kramer, S. Filtered circular fingerprints improve either prediction or runtime performance while retaining interpretability. J. Cheminform.
**2016**, 8, 60. [Google Scholar] [CrossRef] [PubMed] [Green Version] - Landrum G RDKit: Colliding Bits III. Available online: http://rdkit.blogspot.com/2016/02/colliding-bits-iii.html (accessed on 23 December 2019).
- Alygizakis, N.; Slobodnik, J. S32 | REACH2017 | >68,600 REACH Chemicals. 2018. Available online: https://zenodo.org/record/4248826 (accessed on 23 December 2020).
- Breiman, L. Random forests. Mach. Learn.
**2001**, 45, 5–32. [Google Scholar] [CrossRef] [Green Version] - Hastie, T.; Tibshirani, R.; Friedman, J. The Elements of Statistical Learning; Springer: New York, NY, USA, 2009. [Google Scholar]
- Cover, T.; Hart, P. Nearest neighbor pattern classfication. IEEE Trans. Inf. Theory
**1967**, 13, 21–27. [Google Scholar] [CrossRef] - Lučić, B.; Batista, J.; Bojović, V.; Lovrić, M.; Kržić, A.S.; Bešlo, D.; Nadramija, D.; Vikić-Topić, D. Estimation of random accuracy and its use in validation of predictive quality of classification models within predictive challenges. Croat. Chem. Acta
**2019**, 92, 379–391. [Google Scholar] [CrossRef] [Green Version] - Boughorbel, S.; Jarray, F.; El Anbari, M. Optimal classifier for imbalanced data using Matthews Correlation Coefficient metric. PLoS ONE
**2017**, 12, e0177678. [Google Scholar] [CrossRef] - Žuvela, P.; Lovrić, M.; Yousefian-Jazi, A.; Liu, J.J. Ensemble learning approaches to data imbalance and competing objectives in design of an industrial machine vision system. Ind. Eng. Chem. Res.
**2020**, 59, 4636–4645. [Google Scholar] [CrossRef] - Lerman, P.M. Fitting segmented regression models by Grid Search. J. R. Stat. Soc. Ser. C Appl. Stat.
**1980**, 29, 77. [Google Scholar] [CrossRef] - Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; et al. Scikit-learn: Machine Learning in Python. J. Mach. Learn Res.
**2011**, 12, 2825–2830. [Google Scholar] [CrossRef] - Deisenroth, M.P.; Faisal, A.A.; Ong, C.S. Mathematics for Machine Learning; Cambridge University Press: Cambridge, UK, 2020; p. 391. [Google Scholar]
- Sainburg, T.; McInnes, L.; Gentner, T.Q. Parametric UMAP embeddings for representation and semi-supervised learning. arXiv
**2020**, arXiv:2009.12981. [Google Scholar] - Kramer, M.A. Nonlinear principal component analysis using autoassociative neural networks. AIChE J.
**1991**, 37, 233–243. [Google Scholar] [CrossRef] - Jordan, M.I.; Ghahramani, Z.; Jaakkola, T.S.; Saul, L.K. An introduction to variational methods for graphical models. Mach. Learn.
**1999**, 37, 183–233. [Google Scholar] [CrossRef]

**Figure 1.**Exemplary visualization of CS1 (red) and CS2 (blue) show in 2D embedded space generated from molecular fingerprints by means of (

**a**) principal component analysis (PCA), (

**b**) uniform manifold approximation and projection (UMAP), and (

**c**) variational autoencoders (VAEs).

**Figure 2.**Dependence of classification results or transferred embeddings of CS2 by means of Matthews correlation coefficient (MCC) on the log-size of CS1 and dimensions of the PCA embeddings. The three figures represent three classifying algorithms, namely, (

**a**) RFC, (

**b**) KNN, and (

**c**) LR.

**Figure 3.**Dependence of classification results or transferred embeddings of CS2 by means of MCC on the log-size of CS1 and dimensions of the UMAP embeddings. The three figures represent three classifying algorithms, namely, (

**a**) RFC, (

**b**) KNN, and (

**c**) LR.

**Figure 4.**Dependence of classification results or transferred embeddings of CS2 by means of MCC on the log-size of CS1 and dimensions of the VAE embeddings. The three figures represent three classifying algorithms, namely, (

**a**) RFC, (

**b**) KNN, and (

**c**) LR.

**Figure 5.**Comparison of machine learning classifications mean + error bar across all feature sets: FPR-BL = fingerprint baseline, PCA = principal component analysis (EX = external, IN = internal), UMP = uniform manifold approximation and projection, VAE = variational autoencoders. Each bar present nine runs (three classifiers × three random states).

**Figure 6.**Top: Silhouette plots for best (top left) and worst (top right) performing clustering, which are VAE with NR-AR and UMAP with SR-HSE, respectively. The gray and green masses are points in two classes (0 and 1) with their silhouette coefficients, respectively. The dashed red line is the average value of both. The scatter plot visualizes their coordinates in the 2D embedded space. The number is their center points for each class (0 and 1). We notice that, for the best clustering, the silhouette coefficient tends to be higher, and points that belong to NR-AR label visually agglomerate together. Conversely, the silhouette coefficient seems to be negative for the worst clustering, and the points belonging to SR-HSE label are visually indistinguishable from other points. Silhouette coefficients for externally embedded CS2 data (bottom) by means of the three embedders (PCA, UMAP, VAE) calculated per label, which are presented by the color map. We see that VAE gains a higher silhouette coefficient on most tasks than UMAP or PCA, indicating a better separation.

**Figure 7.**An example of dimensionality reduction by means of PCA. Instances/points in a 3D space (original space) are transformed into a 2D space of two latent variables called principal components (PC1 and PC2).

**Figure 8.**Visual explanation of how UMAP works. It first computes a graph representation of the input data, which is then used to learn embeddings that preserve the structure of the graph representation. Figures is redrawn based on ref. [57].

**Figure 9.**Architecture illustration of the variational autoencoder. Encoder compresses the input X into a latent representation Z. VAE is different to a standard autoencoder as it assumes that the input data have an underlying probability distribution (e.g., Gaussian) for which they try to optimize parameters. The decoder then attempts to reconstruct the original input from the representation by minimizing the reconstruction loss.

**Figure 10.**In external transfer learning, an embedder (PCA, UMAP, VAE) is fit on fingerprints on an external set of fingerprints (CS1). The same model (pre-trained embedder) is then utilized to encode fingerprints from CS2. In internal transfer learning, the embedder is fit on the pre-split train set of CS2 and used to encode the test set of CS2. The embeddings of CS2 were utilized for training predictive classification tasks.

**Figure 11.**Schematics of chemical space transformation from fingerprint through a pre-trained embedder model. The transformation can either be conducted from an external data set to the data set of interest or within the data set of interest, but split into the train and test set.

**Table 1.**A comparison of our baseline results trained on fingerprints to a similar study from Zhang et al. [37]. The results from Zhang are denoted with a “Z“, while the respective classifiers are as follows: L—lightGBM, R—random forests, S—support vector machines, X—XGBM, D—deep neural networks. The classifiers from this work are k-nearest neighbor classifier (KNN), logistic regression (LR), and random forests classifier (RFC), which are represented by their mean values per classifier, respectively. Additionally, the mean and max of all classifiers in this work are compared. The best baseline models in our work are marked with an superscript “a“, while the best models from Zhang are marked with an superscript “b“.

Label (endpoint) | Mean (all) | Max (all) | KNN | LR | RFC | Z-L | Z-R | Z-S | Z-X | Z-D |
---|---|---|---|---|---|---|---|---|---|---|

NR-AR | 0.52 | 0.62 | ^{a}0.59 | 0.4 | 0.56 | 0.50 | 0.62 | 0.43 | 0.60 | ^{b}0.68 |

NR-AR-LBD | 0.57 | 0.63 | 0.61 | 0.48 | ^{a}0.62 | 0.60 | 0.71 | 0.60 | ^{b}0.73 | 0.72 |

NR-AhR | 0.44 | 0.47 | ^{a}0.45 | 0.44 | 0.43 | 0.52 | ^{b}0.61 | 0.47 | 0.54 | 0.59 |

NR-Aromatase | 0.29 | 0.35 | ^{a}0.32 | 0.25 | 0.29 | 0.28 | ^{b}0.52 | 0.32 | 0.50 | 0.48 |

NR-ER | 0.29 | 0.34 | ^{a}0.33 | 0.24 | 0.29 | 0.37 | 0.42 | 0.32 | 0.40 | ^{b}0.44 |

NR-ER-LBD | 0.35 | 0.47 | 0.37 | 0.26 | ^{a}0.42 | 0.45 | 0.56 | 0.36 | ^{b}0.59 | 0.58 |

NR-PPAR-gamma | 0.18 | 0.26 | 0.14 | 0.18 | ^{a}0.22 | 0.32 | 0.50 | 0.30 | ^{b}0.52 | 0.47 |

SR-ARE | 0.28 | 0.36 | 0.25 | ^{a}0.31 | 0.29 | 0.46 | 0.49 | 0.36 | 0.46 | 0.48 |

SR-ATAD5 | 0.24 | 0.26 | ^{a}0.25 | 0.22 | 0.24 | 0.37 | ^{b}0.59 | 0.36 | 0.53 | 0.55 |

SR-HSE | 0.18 | 0.25 | 0.15 | 0.18 | ^{a}0.20 | 0.31 | ^{b}0.37 | 0.21 | 0.40 | ^{b}0.37 |

SR-MMP | 0.44 | 0.47 | 0.44 | ^{a}0.47 | 0.43 | 0.63 | ^{b}0.65 | 0.54 | 0.64 | 0.63 |

SR-p53 | 0.22 | 0.26 | 0.21 | ^{a}0.24 | 0.23 | 0.42 | ^{b}0.57 | 0.37 | 0.52 | 0.55 |

**Table 2.**Comparison of external and internal embeddings for PCA, UMAP, and VAE. Each cell represents the mean MCC score across nine different machine learning models (three random states × three classifiers). Values marked with an asterisk (*) highlight cases where, on average, models trained using external knowledge outperformed models trained on internal knowledge. Additionally, results marked with a quotation mark (‘) highlight cases where using external or internal knowledge yielded equal results (when rounded off to two decimal places).

Label | PCA | UMAP | VAE | |||
---|---|---|---|---|---|---|

IN | EX | IN | EX | IN | EX | |

NR-AR | 0.45 | 0.43 | ‘ 0.47 | ‘ 0.47 | 0.45 | 0.44 |

NR-AR-LBD | 0.45 | 0.43 | * 0.45 | * 0.53 | ‘ 0.43 | ‘ 0.43 |

NR-AhR | ‘ 0.33 | ‘ 0.33 | * 0.34 | * 0.35 | ‘ 0.34 | ‘ 0.34 |

NR-Aromatase | 0.22 | 0.18 | 0.18 | 0.15 | ‘ 0.21 | ‘ 0.21 |

NR-ER | 0.22 | 0.21 | ‘ 0.23 | ‘ 0.23 | ‘ 0.27 | 0.24 |

NR-ER-LBD | 0.31 | 0.26 | ‘ 0.26 | ‘ 0.26 | ‘ 0.28 | ‘ 0.28 |

NR-PPAR-gamma | ‘ 0.14 | ‘ 0.14 | * 0.09 | * 0.11 | 0.11 | 0.09 |

SR-ARE | * 0.19 | * 0.21 | * 0.19 | * 0.2 | * 0.19 | * 0.2 |

SR-ATAD5 | 0.16 | 0.13 | * 0.12 | * 0.16 | * 0.14 | * 0.15 |

SR-HSE | * 0.09 | * 0.11 | ‘ 0.08 | ‘ 0.08 | * 0.07 | * 0.1 |

SR-MMP | ‘ 0.36 | ‘ 0.36 | ‘ 0.32 | ‘ 0.32 | * 0.35 | * 0.36 |

SR-p53 | 0.19 | 0.18 | 0.16 | 0.14 | * 0.15 | * 0.17 |

**Table 3.**Classification results across all data sets and labels expressed by maximum values of the MCC. The fingerprint-based model maxima (FPR-BL) were set as 100%, while the embedding models referred to these 100%. Results assigned with an asterisk (*) outperformed baseline.

Label (endpoint) | FPR-BL | PCA-EX | PCA-IN | UMAP-EX | UMAP-IN | VAE-EX | VAE-IN |
---|---|---|---|---|---|---|---|

NR-AR | 100 | 95 | 99 | 96 | 96 | 97 | * 100 |

NR-AR-LBD | 100 | 92 | 98 | * 100 | 97 | 98 | * 102 |

NR-AhR | 100 | 84 | 85 | 90 | 86 | 83 | 85 |

NR-Aromatase | 100 | 65 | 82 | 75 | 74 | 84 | 75 |

NR-ER | 100 | 83 | 85 | 86 | 95 | * 103 | * 101 |

NR-ER-LBD | 100 | 70 | 90 | 79 | 88 | 90 | 83 |

NR-PPAR-gamma | 100 | 81 | 82 | 68 | 81 | 78 | 75 |

SR-ARE | 100 | 74 | 69 | 70 | 69 | 63 | 63 |

SR-ATAD5 | 100 | 63 | 99 | 92 | 84 | 75 | 88 |

SR-HSE | 100 | 82 | 77 | 56 | 90 | 65 | 55 |

SR-MMP | 100 | 92 | 87 | 83 | 83 | 93 | 89 |

SR-p53 | 100 | 99 | 95 | 79 | 87 | 85 | 93 |

**Table 4.**Correlation of average classification results of the embedded classifiers (PCA-EX, UMAP-EX. VAE-EX) with the imbalance ratio (Pos class %) and baseline fingerprints classifiers (FPR-BL) with their respective silhouette coefficients—s(PCA), s(UMAP), and s(VAE).

PCA-EX | UMAP-EX | VAE-EX | |
---|---|---|---|

s(PCA) | 0.74 | ||

s(UMAP) | 0.86 | ||

s(VAE) | 0.85 | ||

Pos class % | 0.11 | 0.02 | 0.13 |

FPR-BL | 0.98 | 0.98 | 0.99 |

**Table 5.**Elements of the confusion matrix that show the possible outcomes when predicting labels in Tox21.

Experimental/Model | Positive (Model) (1) | Negative (Model) (0) |
---|---|---|

Positive (Experimental) (1) | TP (experimentally active and predicted active) | FN (experimentally active, but predicted as inactive) |

Negative (Experimental) (0) | FP (experimentally inactive, but predicted as active) | TN (inactive experimentally and predicted) |

Predictive Variables | Classifier | Seed | Embedder | Emb. Dim. | CS1 Data Size | N Models |
---|---|---|---|---|---|---|

Fingerprints (raw data) | RFC, KNN, LR | 1–3 | N/A | N/A | N/A | 144 |

Internal emb. | RFC, KNN, LR | 1–3 | PCA, UMAP, VAE | 2–15 | N/A | 9072 |

External emb. | RFC, KNN, LR | 1–3 | PCA, UMAP, VAE | 2–15 | 200–30,000 | 9072 |

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |

© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Lovrić, M.; Đuričić, T.; Tran, H.T.N.; Hussain, H.; Lacić, E.; Rasmussen, M.A.; Kern, R.
Should We Embed in Chemistry? A Comparison of Unsupervised Transfer Learning with PCA, UMAP, and VAE on Molecular Fingerprints. *Pharmaceuticals* **2021**, *14*, 758.
https://doi.org/10.3390/ph14080758

**AMA Style**

Lovrić M, Đuričić T, Tran HTN, Hussain H, Lacić E, Rasmussen MA, Kern R.
Should We Embed in Chemistry? A Comparison of Unsupervised Transfer Learning with PCA, UMAP, and VAE on Molecular Fingerprints. *Pharmaceuticals*. 2021; 14(8):758.
https://doi.org/10.3390/ph14080758

**Chicago/Turabian Style**

Lovrić, Mario, Tomislav Đuričić, Han T. N. Tran, Hussain Hussain, Emanuel Lacić, Morten A. Rasmussen, and Roman Kern.
2021. "Should We Embed in Chemistry? A Comparison of Unsupervised Transfer Learning with PCA, UMAP, and VAE on Molecular Fingerprints" *Pharmaceuticals* 14, no. 8: 758.
https://doi.org/10.3390/ph14080758