Next Article in Journal
A Functional Analysis of the Cyclophilin Repertoire in the Protozoan Parasite Trypanosoma Cruzi
Previous Article in Journal
The MNN2 Gene Knockout Modulates the Antifungal Resistance of Biofilms of Candida glabrata
Previous Article in Special Issue
Predicting Aromatic Amine Mutagenicity with Confidence: A Case Study Using Conformal Prediction
Article Menu
Issue 4 (December) cover image

Export Article

Open AccessArticle
Biomolecules 2018, 8(4), 131; https://doi.org/10.3390/biom8040131

Improving Chemical Autoencoder Latent Space and Molecular De Novo Generation Diversity with Heteroencoders

1
Wildcard Pharmaceutical Consulting, Zeaborg Science Center, Frødings Allé 41, 2860 Søborg, Denmark
2
Science Data Software LLC, 14914 Bradwill Court, Rockville, MD 20850, USA
*
Author to whom correspondence should be addressed.
Received: 23 September 2018 / Revised: 22 October 2018 / Accepted: 23 October 2018 / Published: 30 October 2018
(This article belongs to the Special Issue Machine Learning for Molecular Modelling in Drug Design)
Full-Text   |   PDF [829 KB, uploaded 30 October 2018]   |  

Abstract

Chemical autoencoders are attractive models as they combine chemical space navigation with possibilities for de novo molecule generation in areas of interest. This enables them to produce focused chemical libraries around a single lead compound for employment early in a drug discovery project. Here, it is shown that the choice of chemical representation, such as strings from the simplified molecular-input line-entry system (SMILES), has a large influence on the properties of the latent space. It is further explored to what extent translating between different chemical representations influences the latent space similarity to the SMILES strings or circular fingerprints. By employing SMILES enumeration for either the encoder or decoder, it is found that the decoder has the largest influence on the properties of the latent space. Training a sequence to sequence heteroencoder based on recurrent neural networks (RNNs) with long short-term memory cells (LSTM) to predict different enumerated SMILES strings from the same canonical SMILES string gives the largest similarity between latent space distance and molecular similarity measured as circular fingerprints similarity. Using the output from the code layer in quantitative structure activity relationship (QSAR) of five molecular datasets shows that heteroencoder derived vectors markedly outperforms autoencoder derived vectors as well as models built using ECFP4 fingerprints, underlining the increased chemical relevance of the latent space. However, the use of enumeration during training of the decoder leads to a marked increase in the rate of decoding to different molecules than encoded, a tendency that can be counteracted with more complex network architectures. View Full-Text
Keywords: deep learning; RNN; LSTM; de novo molecule design; molecular autoencoders; molecular heteroencoders; molecular data augmentation deep learning; RNN; LSTM; de novo molecule design; molecular autoencoders; molecular heteroencoders; molecular data augmentation
Figures

Graphical abstract

This is an open access article distributed under the Creative Commons Attribution License which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited (CC BY 4.0).

Supplementary material

SciFeed

Share & Cite This Article

MDPI and ACS Style

Bjerrum, E.J.; Sattarov, B. Improving Chemical Autoencoder Latent Space and Molecular De Novo Generation Diversity with Heteroencoders. Biomolecules 2018, 8, 131.

Show more citation formats Show less citations formats

Note that from the first issue of 2016, MDPI journals use article numbers instead of page numbers. See further details here.

Related Articles

Article Metrics

Article Access Statistics

1

Comments

[Return to top]
Biomolecules EISSN 2218-273X Published by MDPI AG, Basel, Switzerland RSS E-Mail Table of Contents Alert
Back to Top