# Improving Chemical Autoencoder Latent Space and Molecular De Novo Generation Diversity with Heteroencoders

^{1}

^{2}

^{*}

## Abstract

**:**

## 1. Introduction

## 2. Results

#### 2.1. GDB-8 Dataset Based Models

#### 2.1.1. Molecular and Sequence Similarity

#### 2.1.2. Error Analysis

#### 2.1.3. Enumeration Challenge

#### 2.1.4. Sampling Using Probability Distribution

#### 2.2. QSAR Modelling Using ChEMBL Trained Heteroencoders

## 3. Discussion

^{2}of 0.92 and a standard deviation of prediction of the test set of 0.6 [16]. Likewise, a carefully crafted QSAR model of BCF obtained a R

^{2}of 0.73 and an RMSE of 0.69 [17], which is on par with our model using the can2enum derived latent vectors. However, a later benchmark showed better performance for the CORAL software for prediction of BCF (R

^{2}: 0.76, RMSE: 0.64) [18], suggesting that further improvements are possible.

## 4. Materials and Methods

#### 4.1. Datasets

#### 4.1.1. GDB-8

#### 4.1.2. ChEMBL23

#### 4.1.3. QSAR Datasets

#### 4.2. 1D and 2D Vectorization

#### 4.3. Neural Network Modeling for GDB-8 Dataset

#### 4.4. Similarity Metrics

#### 4.5. Enumeration Challenge

#### 4.6. Error Analysis of Output

#### 4.7. Multinomial Sampling of Decoder

#### 4.8. Neural Network Modelling for the ChEMBL Dataset

#### 4.9. QSAR Modelling

## 5. Conclusions

## Supplementary Materials

## Author Contributions

## Acknowledgments

## Conflicts of Interest

## Abbreviations

Conv2D | 2D convolutional layer |

ECFP4 | Extended connectivity fingerprint with 4 bonds |

GRU | Gated Recurrent Unit |

LSTM | Long short-term memory |

QSAR | Quantitative structure activity relationship |

ReLU | Rectified liniar unit |

RMSE | Root mean square error |

RNN | Recurrent Neural Network |

SMILES | Simplified molecular-input line-entry system |

## References

- Gómez-Bombarelli, R.; Duvenaud, D.; Hernández-Lobato, J.M.; Aguilera-Iparraguirre, J.; Hirzel, T.D.; Adams, R.P.; Aspuru-Guzik, A. Automatic chemical design using a data-driven continuous representation of molecules. arXiv, 2016; arXiv:1610.02415. [Google Scholar] [CrossRef] [PubMed]
- Blaschke, T.; Olivecrona, M.; Engkvist, O.; Bajorath, J.; Chen, H. Application of Generative Autoencoder in De Novo Molecular Design. Mol. Inform.
**2017**, 37, 1700123. [Google Scholar] [CrossRef] [PubMed][Green Version] - Xu, Z.; Wang, S.; Zhu, F.; Huang, J. Seq2Seq Fingerprint: An Unsupervised Deep Molecular Embedding for Drug Discovery. In Proceedings of the 8th ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics, ACM-BCB ’17, Boston, MA, USA, 20–23 August 2017; ACM: New York, NY, USA, 2017; pp. 285–294. [Google Scholar] [CrossRef]
- Chung, J.; Gulcehre, C.; Cho, K.; Bengio, Y. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv, 2014; arXiv:1412.3555. [Google Scholar]
- Hochreiter, S.; Schmidhuber, J. Long short-term memory. Neural Comput.
**1997**, 9, 1735–1780. [Google Scholar] [CrossRef] [PubMed] - Chen, H.; Kogej, T.; Engkvist, O. Cheminformatics in Drug Discovery, an Industrial Perspective. Mol. Inform.
**2018**, 37, e1800041. [Google Scholar] [CrossRef] [PubMed] - Weininger, D. SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules. Proc. Edinb. Math. Soc.
**1970**, 17, 1–14. [Google Scholar] [CrossRef] - Bjerrum, E.J. SMILES Enumeration as Data Augmentation for Neural Network Modeling of Molecules. arXiv, 2017; arXiv:1703.07076. [Google Scholar]
- Li, Y.; Zhang, L.; Liu, Z. Multi-Objective De Novo Drug Design with Conditional Graph Generative Model. arXiv, 2018; arXiv:1801.07299. [Google Scholar] [CrossRef] [PubMed]
- Goh, G.B.; Siegel, C.; Vishnu, A.; Hodas, N.O.; Baker, N. Chemception: A deep neural network with minimal chemistry knowledge matches the performance of expert-developed qsar/qspr models. arXiv, 2017; arXiv:1706.06689. [Google Scholar]
- Landrum, G.A. RDKit: Open-Source Cheminformatics Software. Available online: http://www.rdkit.org/ (accessed on 1 July 2018).
- Bjerrum, E.J.; Threlfall, R. Molecular Generation with Recurrent Neural Networks (RNNs). arXiv, 2017; arXiv:1705.04612. [Google Scholar]
- Duvenaud, D.K.; Maclaurin, D.; Iparraguirre, J.; Bombarell, R.; Hirzel, T.; Aspuru-Guzik, A.; Adams, R.P. Convolutional networks on graphs for learning molecular fingerprints. In Proceedings of the 28th International Conference on Neural Information Processing Systems, Montreal, QC, Canada, 7–12 December 2015; pp. 2224–2232. [Google Scholar]
- Kearnes, S.; McCloskey, K.; Berndl, M.; Pande, V.; Riley, P. Molecular graph convolutions: Moving beyond fingerprints. J. Comput.-Aided Mol. Des.
**2016**, 30, 595–608. [Google Scholar] [CrossRef] [PubMed] - Winter, R.; Montanari, F.; Noé, F.; Clevert, D.A. Learning Continuous and Data-Driven Molecular Descriptors by Translating Equivalent Chemical Representations. ChemRxiv
**2018**. [Google Scholar] [CrossRef] - Huuskonen, J. Estimation of aqueous solubility for a diverse set of organic compounds based on molecular topology. J. Chem. Inf. Comput. Sci.
**2000**, 40, 773–777. [Google Scholar] [CrossRef] [PubMed] - Gissi, A.; Gadaleta, D.; Floris, M.; Olla, S.; Carotti, A.; Novellino, E.; Benfenati, E.; Nicolotti, O. An alternative QSAR-based approach for predicting the bioconcentration factor for regulatory purposes. ALTEX-Altern. Anim. Exp.
**2014**, 31, 23–36. [Google Scholar][Green Version] - Gissi, A.; Lombardo, A.; Roncaglioni, A.; Gadaleta, D.; Mangiatordi, G.F.; Nicolotti, O.; Benfenati, E. Evaluation and comparison of benchmark QSAR models to predict a relevant REACH endpoint: The bioconcentration factor (BCF). Environ. Res.
**2015**, 137, 398–409. [Google Scholar] [CrossRef] [PubMed] - Open Science Data Repository. Features Computation Beta. Available online: http://ssp.dataledger.io/features (accessed on 1 September 2018).
- Polishchuk, P.G.; Madzhidov, T.I.; Varnek, A. Estimation of the size of drug-like chemical space based on GDB-17 data. J. Comput.-Aided Mol. Des.
**2013**, 27, 675–679. [Google Scholar] [CrossRef] [PubMed] - Ruddigkeit, L.; Van Deursen, R.; Blum, L.C.; Reymond, J.L. Enumeration of 166 billion organic small molecules in the chemical universe database GDB-17. J. Chem. Inf. Model.
**2012**, 52, 2864–2875. [Google Scholar] [CrossRef] [PubMed] - Gaulton, A.; Hersey, A.; Nowotka, M.; Bento, A.P.; Chambers, J.; Mendez, D.; Mutowo, P.; Atkinson, F.; Bellis, L.J.; Cibrián-Uhalte, E.; et al. The ChEMBL database in 2017. Nucleic Acids Res.
**2017**, 45, D945–D954. [Google Scholar] [CrossRef] [PubMed] - EPI Suite Data. Available online: http://esc.syrres.com/interkow/EpiSuiteData.htm (accessed on 1 July 2018).
- EPA U. Estimation Programs Interface Suite
^{TM}for Microsoft^{®}Windows, v 4.11; United States Environmental Protection Agency: Washington, DC, USA, 2018. [Google Scholar] - Open Science Data Repository. Features Computation Beta. Available online: https://ssp.dataledger.io/file/00120000-ac12-0242-bcea-08d5f0abb793 (accessed on 1 September 2018).
- Arnot, J.A.; Gobas, F.A. A review of bioconcentration factor (BCF) and bioaccumulation factor (BAF) assessments for organic chemicals in aquatic organisms. Environ. Rev.
**2006**, 14, 257–297. [Google Scholar] [CrossRef] - Schultz, T.W. Tetratox: Tetrahymena pyriformis population growth impairment endpointa surrogate for fish lethality. Toxicol. Methods
**1997**, 7, 289–309. [Google Scholar] [CrossRef] - ChemIDplus Database. Available online: http://chem.sis.nlm.nih.gov/chemidplus/chemidheavy.jsp (accessed on 1 July 2018).
- Mentel, L. Mendeleev—A Python Resource for Properties of Chemical Elements, Ions and Isotopes. 2014. Available online: https://bitbucket.org/lukaszmentel/mendeleev (accessed on 1 July 2018).
- Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; et al. Scikit-learn: Machine Learning in Python. J. Mach. Learn. Res.
**2011**, 12, 2825–2830. [Google Scholar] - Walt, S.V.d.; Colbert, S.C.; Varoquaux, G. The NumPy array: A structure for efficient numerical computation. Comput. Sci. Eng.
**2011**, 13, 22–30. [Google Scholar] [CrossRef] - Chollet, F. Keras. Available online: https://github.com/fchollet/keras (accessed on 17 September 2018).
- Abadi, M.; Agarwal, A.; Barham, P.; Brevdo, E.; Chen, Z.; Citro, C.; Corrado, G.; Davis, A.; Dean, J.; Devin, M.; et al. Tensorflow: Large-scale machine learning on heterogeneous distributed systems. arXiv, 2016; arXiv:1603.04467. [Google Scholar]
- Nair, V.; Hinton, G.E. Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th International Conference on Machine Learning (ICML-10), Haifa, Israel, 21–24 June 2010; pp. 807–814. [Google Scholar]
- Williams, R.J.; Zipser, D. A Learning Algorithm for Continually Running Fully Recurrent Neural Networks. Neural Comput.
**1989**, 1, 270–280. [Google Scholar] [CrossRef] - Kingma, D.; Ba, J. Adam: A method for stochastic optimization. arXiv, 2014; arXiv:1412.6980. [Google Scholar]
- Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going Deeper with Convolutions. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015. [Google Scholar]
- Cock, P.J.A.; Antao, T.; Chang, J.T.; Chapman, B.A.; Cox, C.J.; Dalke, A.; Friedberg, I.; Hamelryck, T.; Kauff, F.; Wilczynski, B.; et al. Biopython: Freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics
**2009**, 25, 1422–1423. [Google Scholar] [CrossRef] [PubMed] - Bemis, G.W.; Murcko, M.A. The Properties of Known Drugs. 1. Molecular Frameworks. J. Med. Chem.
**1996**, 39, 2887–2893. [Google Scholar] [CrossRef] [PubMed] - Van Rossum, G.; Drake, F.L., Jr. Python Reference Manual; Centrum voor Wiskunde en Informatica Amsterdam: Amsterdam, The Netherlands, 1995. [Google Scholar]
- Open Science Data Repository. Available online: http://osdr.dataledger.io/ (accessed on 10 September 2018).
- Bergstra, J.; Bardenet, R.; Bengio, Y.; Kégl, B. Algorithms for hyper-parameter optimization. In Proceedings of the 24th International Conference on Neural Information Processing Systems, Granada, Spain, 12–15 December 2011. [Google Scholar]
- Hyperopt: Distributed Asynchronous Hyper-Parameter Optimization. Available online: https://github.com/hyperopt/hyperopt (accessed on 1 July 2018).

**Figure 1.**Enumeration challenge of a sequence to sequence model trained on canonical SMILES (simplified molecular-input line-entry system). The non-canonical SMILES of the same molecule is projected to different parts of the latent space reduced to two dimensions with principal components analysis (PCA). The small blue dots are the test set used for fitting the PCA. Some clustering of the enumerated SMILES can be observed.

**Figure 2.**Chemical heteroencoders are similar to autoencoders but translates from one representation of the molecule to the other. The molecule toluene can be represented as a canonical SMILES strings, in different enumerated SMILES or via a 2D embedding. The autoencoder converts the canonical SMILES string to the latent space and back again (blue arrow), whereas many more possibilities exists for heteroencoders (green arrows).

**Figure 3.**Examples of optimal SMILES alignments of a molecule with two other molecules. The score is +1 for character match, $-1$ for mismatch, gap openings $-0.5$ and gap extension $-0.05$. Gaps are show with dashes, “-”, and are not SMILES single bonds.

**Figure 4.**Scatter plot of the latent space similarities and the alignment scores of the SMILES strings.

**Figure 6.**Molecules similar in latent space using the can2can model. The reference molecule is in the upper left corner and similarity drops row-wise in normal reading direction.

**Figure 7.**Molecules similar in latent space using the can2enum model. The reference molecule is in the upper left corner and similarity drops row-wise in normal reading direction.

**Figure 8.**Venn diagram of the errors encounted during molecule reconstruction of 1000 molecules for the GDB-8 can2enum model.

**Figure 9.**SMILES enumeration challenge of the GDB-8 dataset based Enum2can and Can2enum encoders. The same three molecules were encoded from 10 enumerated SMILES and projected to the latent space reduced to two dimensions with principal components analysis (PCA). Using enumerated SMILES for training of the encoder leads to the tightest clustering, but also training with the enumerated SMILES in the decoder improves the clustering (c.f. Figure 1). Small blue dots are the test set used for the PCA reduction.

**Figure 10.**Multinomial sampling of the decoder for two different models illustrated with heat maps of the character probability for each step during decoding of the latent space. (

**A**) the can2can model is very certain at each step and samples the same canonical SMILES each time; (

**B**) the can2enum model has more possibilities at each step in the beginning. The probability heatmap and sampled SMILES will be different for each sampling run, depending on which character is chosen from the probability distribution at each step.

**Figure 11.**Examples of different sampled molecules using multinomial sampling with the decoder from the two layer LSTM model (enum2enum 2-layer). The one in the upper-left corner is the reference molecule used to encode the latent space coordinates.

**Figure 12.**The steps used in the modelling of the QSAR dataset. In step 1, the auto-/heteroencoder is trained on a large unlabelled dataset of molecules from ChEMBL. After training, the encoder part is extracted in step 2 and used to encode the molecules from the QSAR datasets into their latent vectors in step 3. In step 4, a separate standard feed forward neural network is used to build QSAR models from training sets which are subsequently tested with the held-out test set.

**Table 1.**Properties of the models trained on different input and output representations of the GDB-8 dataset. All values were calculated using the test dataset. Strings in the simplified molecular-input line-entry system (SMILES) notation were consideret malformed if they could not be parsed to molecules by RDKit [11].

Model | Loss | % Malformed SMILES | % Wrong Molecule | R${}^{2}$ Fingerprint Metric | R${}^{2}$ Sequence Metric |
---|---|---|---|---|---|

Can2Can | 0.0005 | 0.1 | 0.0 | 0.24 | 0.58 |

Img2Can | 0.02 | 0.0 | 8.0 | 0.05 | 0.18 |

Enum2Can | 0.03 | 1.0 | 17.1 | 0.37 | 0.53 |

Can2Enum | 0.18 | 1.7 | 50.3 | 0.58 | 0.55 |

Enum2Enum | 0.21 | 2.2 | 66.8 | 0.49 | 0.40 |

Enum2Enum 2-layer | 0.13 | 0.3 | 14.7 | 0.45 | 0.55 |

**Table 2.**Statistics on molecule generation with multinomial sampling at t = 1.0, n = 1000, GDB-8 dataset based models.

Can2Can | Can2Enum | Enum2Enum 2-Layer | |
---|---|---|---|

Unique SMILES | 1 | 315 | 111 |

% Correct Mol | 100 | 20 | 57 |

Unique SMILES for correct Mol | 1 | 34 | 42 |

Unique Molecules | 1 | 88 | 17 |

Average Fingerprint Similarity | 1.0 | 0.27 | 0.32 |

**Table 3.**Reconstruction performance on the ChEMBL datasets of the different encoder/decoder configurations.

ChEMBL Model | Invalid SMILES (%) | SMILES Different from Input (%) | Wrong Molecules (%) |
---|---|---|---|

Can2Can | 0.2 | 0.3 | 0.1 |

Enum2Can | 9.3 | 42.5 | 36.6 |

Can2Enum | 9.3 | 99.9 | 65.6 |

Enum2Enum | 6.7 | 100 | 69.9 |

**Table 4.**Performance of the QSAR models on the held out test-set for different input data. The best performance for each metric and dataset is highlighted in bold. R

^{2}is the squared correlation coefficient (closer to one is better), RMSE is the root mean square error of prediction on the test set (lower is better).

Input Type | IGC50 | LD50 | BCF | Solubility | MP | Average | ||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|

R^{2} | RMSE | R^{2} | RMSE | R^{2} | RMSE | R^{2} | RMSE | R^{2} | RMSE | R^{2} | RMSE * | |

Enum2Enum | 0.81 | 0.43 | 0.68 | 0.54 | 0.73 | 0.71 | 0.90 | 0.65 | 0.86 | 37 | 0.80 | 0.75 |

Can2Enum | 0.78 | 0.46 | 0.68 | 0.54 | 0.74 | 0.69 | 0.89 | 0.69 | 0.86 | 37 | 0.79 | 0.77 |

Enum2Can | 0.78 | 0.46 | 0.65 | 0.57 | 0.73 | 0.71 | 0.90 | 0.66 | 0.87 | 38 | 0.78 | 0.78 |

Can2Can | 0.71 | 0.53 | 0.59 | 0.62 | 0.66 | 0.79 | 0.82 | 0.87 | 0.82 | 43 | 0.72 | 0.89 |

ECFP4 | 0.60 | 0.62 | 0.62 | 0.59 | 0.53 | 0.94 | 0.65 | 1.21 | 0.82 | 43 | 0.64 | 1.00 |

Label | Endpoint | Endpoint Values Span | Number of Molecules |
---|---|---|---|

BCF | Bioconcentration factor, the logarithm of the ratio of the concentration in biota to its concentration in the surrounding medium (water) [26] | −1.7 to 5.7 | 541 |

IGC50 | Tetrahymena pyriformis 50% growth inhibition concentration (g/L) [27] | 0.3 to 6.4 | 1434 |

LD50 | Lethal Dosis 50% rats (mg/kg body weight) [28] | 0.5 to 7.1 | 5931 |

MP | Melting point of solids at normal atmospheric pressure [23] | −196 to 493 | 7509 |

Solubility | log water solubility (mol/L) [16] | −11.6 to 1.6 | 1297 |

Hyper Parameter | Search Space |
---|---|

Input dropout | 0.0–0.95 |

Units per layer | 2–1024 |

Kernel regularizer (L2) | 0.000001–0.1 |

Kernel constraint (maxnorm) | 0.5–6 |

Kernel initializer | ‘lecun_uniform’ ‘glorot_uniform’, ‘he_uniform’, ‘lecun_normal’, ‘glorot_normal’, ‘he_normal’ |

Batch normalization | Yes (after each activation), No |

Activation function | ReLU, SeLU |

Dropout | 0.0–0.95 |

Number of hidden layers | 1–6 |

Learning rate | 0.00001–0.1 |

Optimizer | Adam, Nadam, RMSprop, SGD |

© 2018 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Bjerrum, E.J.; Sattarov, B. Improving Chemical Autoencoder Latent Space and Molecular *De Novo* Generation Diversity with Heteroencoders. *Biomolecules* **2018**, *8*, 131.
https://doi.org/10.3390/biom8040131

**AMA Style**

Bjerrum EJ, Sattarov B. Improving Chemical Autoencoder Latent Space and Molecular *De Novo* Generation Diversity with Heteroencoders. *Biomolecules*. 2018; 8(4):131.
https://doi.org/10.3390/biom8040131

**Chicago/Turabian Style**

Bjerrum, Esben Jannik, and Boris Sattarov. 2018. "Improving Chemical Autoencoder Latent Space and Molecular *De Novo* Generation Diversity with Heteroencoders" *Biomolecules* 8, no. 4: 131.
https://doi.org/10.3390/biom8040131