Next Article in Journal
A Functional Analysis of the Cyclophilin Repertoire in the Protozoan Parasite Trypanosoma Cruzi
Next Article in Special Issue
Machine Learning for Molecular Modelling in Drug Design
Previous Article in Journal
The MNN2 Gene Knockout Modulates the Antifungal Resistance of Biofilms of Candida glabrata
Previous Article in Special Issue
Predicting Aromatic Amine Mutagenicity with Confidence: A Case Study Using Conformal Prediction
Open AccessArticle

Improving Chemical Autoencoder Latent Space and Molecular De Novo Generation Diversity with Heteroencoders

1
Wildcard Pharmaceutical Consulting, Zeaborg Science Center, Frødings Allé 41, 2860 Søborg, Denmark
2
Science Data Software LLC, 14914 Bradwill Court, Rockville, MD 20850, USA
*
Author to whom correspondence should be addressed.
Biomolecules 2018, 8(4), 131; https://doi.org/10.3390/biom8040131
Received: 23 September 2018 / Revised: 22 October 2018 / Accepted: 23 October 2018 / Published: 30 October 2018
(This article belongs to the Special Issue Machine Learning for Molecular Modelling in Drug Design)
Chemical autoencoders are attractive models as they combine chemical space navigation with possibilities for de novo molecule generation in areas of interest. This enables them to produce focused chemical libraries around a single lead compound for employment early in a drug discovery project. Here, it is shown that the choice of chemical representation, such as strings from the simplified molecular-input line-entry system (SMILES), has a large influence on the properties of the latent space. It is further explored to what extent translating between different chemical representations influences the latent space similarity to the SMILES strings or circular fingerprints. By employing SMILES enumeration for either the encoder or decoder, it is found that the decoder has the largest influence on the properties of the latent space. Training a sequence to sequence heteroencoder based on recurrent neural networks (RNNs) with long short-term memory cells (LSTM) to predict different enumerated SMILES strings from the same canonical SMILES string gives the largest similarity between latent space distance and molecular similarity measured as circular fingerprints similarity. Using the output from the code layer in quantitative structure activity relationship (QSAR) of five molecular datasets shows that heteroencoder derived vectors markedly outperforms autoencoder derived vectors as well as models built using ECFP4 fingerprints, underlining the increased chemical relevance of the latent space. However, the use of enumeration during training of the decoder leads to a marked increase in the rate of decoding to different molecules than encoded, a tendency that can be counteracted with more complex network architectures. View Full-Text
Keywords: deep learning; RNN; LSTM; de novo molecule design; molecular autoencoders; molecular heteroencoders; molecular data augmentation deep learning; RNN; LSTM; de novo molecule design; molecular autoencoders; molecular heteroencoders; molecular data augmentation
Show Figures

Graphical abstract

MDPI and ACS Style

Bjerrum, E.J.; Sattarov, B. Improving Chemical Autoencoder Latent Space and Molecular De Novo Generation Diversity with Heteroencoders. Biomolecules 2018, 8, 131.

Show more citation formats Show less citations formats
Note that from the first issue of 2016, MDPI journals use article numbers instead of page numbers. See further details here.

Article Access Map by Country/Region

1
Back to TopTop