Accurate Physical Property Predictions via Deep Learning

Neural networks and deep learning have been successfully applied to tackle problems in drug discovery with increasing accuracy over time. There are still many challenges and opportunities to improve molecular property predictions with satisfactory accuracy even further. Here, we proposed a deep-learning architecture model, namely Bidirectional long short-term memory with Channel and Spatial Attention network (BCSA), of which the training process is fully data-driven and end to end. It is based on data augmentation and SMILES tokenization technology without relying on auxiliary knowledge, such as complex spatial structure. In addition, our model takes the advantages of the long- and short-term memory network (LSTM) in sequence processing. The embedded channel and spatial attention modules in turn specifically identify the prime factors in the SMILES sequence for predicting properties. The model was further improved by Bayesian optimization. In this work, we demonstrate that the trained BSCA model is capable of predicting aqueous solubility. Furthermore, our proposed method shows noticeable superiorities and competitiveness in predicting oil–water partition coefficient, when compared with state-of-the-art graphs models, including graph convoluted network (GCN), message-passing neural network (MPNN), and AttentiveFP.


Introduction
Accurate prediction of molecular properties would offer reliable guidance in profiling lead compounds in the drug-discovery process. The traditional drug design workflows, often biased by the experiences of chemists, rely on time-consuming and expensive simulations and experiments to acquire the relevant molecular properties [1]. Launching a novel drug into the market would take more than ten years on average, with a substantial investment of billions of USD [2]. Meanwhile, deep learning shows great success in other fields, such as natural-language processing [3][4][5][6][7][8][9] and pattern recognition [10][11][12][13], as well as the improvement of computing power and dataset availability. Its potential in promoting efficiency and success rate of drug development, in particular the prediction of molecular properties, has been widely investigated for years [14][15][16][17][18]. The current mainstream algorithms for molecular characterization can be simply divided into two categories-a graph model based on molecular graphs, or a sequence model based on SMILES (Simplified Molecular-Input Line-Entry System) [19] sequence input. A molecular graph is a two-dimensional representation of a chemical molecule, and accounts for its topo-structural features and atom connectivity (e.g., adjacency). Graph-based learning methods have been widely developed in the field of drug development [17,[20][21][22][23][24][25]. Various graph neural network (GNN) [26] variants have also demonstrated their effectiveness in capturing inter-node relationships through message-passing between graph nodes. For example, Gilmer  neural network (MPNN) in quantum chemistry calculations [20]. Schütt et al. [25] proposed a Continuous-Filter Convolutional Neural Network modeling quantum interactions in molecules, and AttentiveFP [21] proposed a new type of GNN with graph attention mechanism suitable for molecular characterization. The latter has the best prediction expression on a drug-discovery-related dataset. GNNs are therefore considered to be an attractive modeling method for molecular property prediction.
The SMILES characterization is very popular among chemists and machine-learning researchers. It has been widely used in learning-based analyses in drug discovery [14,[27][28][29]. Segler et al. [27] demonstrated that recurrent neural network (RNN) trained on molecular SMILES strings would be able to learn the grammar of language as well as the distribution of chemical space. Analogous to a text stream, each symbol in SMILES can be regarded as a word. Inspired by the word2vec [3] technology in natural-language processing, the smiles2vec [14] technology similarly processes the SMILES sequences, in which specific combinations of SMILES elements are transformed into alternative representations via a pretraining method. The new representations are then mapped against a pre-defined dictionary for downstream tasks. A neural network combined with smiles2vec shows superior performance in predicting distinct chemical properties, including toxicity, activity, solubility and solvation energy. Meanwhile, transformer [6] becomes another popular class of neural network in natural-language processing. Notable examples include seq2seq [9] and the Bidirectional Encoder Representations from Transformers (BERT) [7]. Due to the similarity between chemical language and natural language, researchers in the field of molecular characterization have also introduced transformer-based large models plus pre-trained and fine-tuned models, and obtained good results in chemical reaction prediction [30] and other fields.
In this study, we exploit the three advantages of SMILES strings over molecular graphs. First, linear strings are generally more compact than graph formats, and comprise only the crucial information for defining the chemistry of a molecule. Second, a single molecule may have multiple possible SMILES strings. For example, CCO, OCC and C(O)C all specify ethanol. The enumeration of characters in SMILES string can be achieved more easily, when compared to the generation of isomorphous graphs. This characteristic is particularly useful for data augmentation, when a dataset generally contains few samples. Finally, SMILES strings carry characters that explicitly indicate substructures and other topological information, such as branching, looping structures, and chirality [31]. Here, we selected SMILES strings as molecular inputs for deep learning, and constructed a Bidirectional long short-term memory with Channel and Spatial Attention network (BCSA). BCSA is based on Bidirectional Long Short-Term Memory (BILSTM) [32], followed by channel attention and spatial attention modules. Data augmentation of molecular structures was achieved by SMILES enumeration [33], to acquire more tokens as inputs for our models. Generalization error was reduced substantially by taking the average prediction of all enumerated SMILES for a given molecule. Due to its importance among the physiochemical properties of a drug molecule, aqueous solubility prediction has been a subject of intensive studies for many years [15,[34][35][36][37][38][39][40][41][42]. Hence, we applied our model and workflow first in predicting water solubility and later extended into the prediction of oil-water partition coefficient. The prediction performance was compared against three other state-of-the-art graphs models, to evaluate its accuracy, generalizability and reliability.

Training of BCSA Model
We aimed to develop a new deep-learning architecture using SMILES sequence autoencoder and explored the role of predicting molecular solubility and other properties. We trained different datasets including training set (7955), validation set (996) and testing set (995). Then, we performed training on the best hyperparameter (Table 1 using both BILSTM model and our BCSA model. Figure 1A indicated the trend of R 2 when we trained on validation sets with 400 epochs. Figure 1B showed the trend of R 2 when we trained on testing sets with 400 epochs. Moreover, the smoothing parameters of the curve is set to approximately 0.8, to reduce the noise. Figure 1 clearly shows that our model has stronger fitting effect and generalizability than the BILSTM model on both the validation and testing set. BILSTM model and our BCSA model. Figure 1A indicated the trend of when we trained on validation sets with 400 epochs. Figure 1B showed the trend of when we trained on testing sets with 400 epochs. Moreover, the smoothing parameters of the curve is set to approximately 0.8, to reduce the noise. Figure 1 clearly shows that our model has stronger fitting effect and generalizability than the BILSTM model on both the validation and testing set.  A general way to improve a deep-learning model is to increase the size of a dataset. Dataset augmentation offers more training data and becomes a particularly effective technique in the field of images. The diversity of SMILES string would generate more new data points. All SMILES strings used in this work were augmented using the SMILES enumeration (SE) technique. Two models were trained with the dataset size up to 20× and 40× of the original segmented dataset (each molecule has 20 and 40 different SMILES representations respectively). Since simple molecules may have less than 20 distinct SMILES string variants, identical variants may be generated in the augmentation step and these duplicates were removed to avoid potential duplicate bias. The numbers of data points between training, validation and testing for the 20× and 40× datasets were 134,454:19,881: 16,834 and 239,260:30,042:39,800, respectively. The model with the best performance of in the training process was taken forward to the validation sets. Table 2 shows the performance results of test datasets by taking the average predicted value for each molecule obtained from the enumerated SMILES. We found that both the stability and generalization ability of the enumerated model were significantly improved. The best result was achieved in the SMILES × 40 datasets, indicating that the enumerated model better paid attention to the different sequence information of the molecules. In contrast, we achieve significant performance improvement for the Cui datasets which had = 0.72-0.79 with = 0.988-1.151 [15], whereas the test performance of in this work is = 0.83-0.88 with = 0.79-0.95. This model will be further trained to with more dataset to improve the final accuracy. A general way to improve a deep-learning model is to increase the size of a dataset. Dataset augmentation offers more training data and becomes a particularly effective technique in the field of images. The diversity of SMILES string would generate more new data points. All SMILES strings used in this work were augmented using the SMILES enumeration (SE) technique. Two models were trained with the dataset size up to 20× and 40× of the original segmented dataset (each molecule has 20 and 40 different SMILES representations respectively). Since simple molecules may have less than 20 distinct SMILES string variants, identical variants may be generated in the augmentation step and these duplicates were removed to avoid potential duplicate bias. The numbers of data points between training, validation and testing for the 20× and 40× datasets were 134,454:19,881:16,834 and 239,260:30,042:39,800, respectively. The model with the best performance of R 2 in the training process was taken forward to the validation sets. Table 2 shows the performance results of test datasets by taking the average predicted value for each molecule obtained from the enumerated SMILES. We found that both the stability and generalization ability of the enumerated model were significantly improved. The best result was achieved in the SMILES × 40 datasets, indicating that the enumerated model better paid attention to the different sequence information of the molecules. In contrast, we achieve significant performance improvement for the Cui datasets which had R 2 = 0.72-0.79 with RMSE = 0.988-1.151 [15], whereas the test performance of in this work is R 2 = 0.83-0.88 with RMSE = 0.79-0.95. This model will be further trained to with more dataset to improve the final accuracy.

Compare with State-of-the-Art Models
To better demonstrate the competitiveness of our sequence-based model, its performance in solubility prediction was compared with other state-of-the-art graph models, including GCN [43], MPNN [20], AttentiveFP [21]. The framework was implemented using python software package dgl-Lifesci [44] (the training hyperparameters detailed in Supporting Information: Tables S1-S3). Figure 2 shows the predicted value against the experimental value from the original test datasets. The predicted values of a better model would populate more closely to the diagonal line (y = x). All four tested models demonstrate excellent predictive abilities. Among them, our BCSA model with 40× data augmentation achieves the best performance on molecular solubility prediction, reflected by a better correlation and a smaller deviation from the experimental data.

Compare with State-of-the-Art Models
To better demonstrate the competitiveness of our sequence-based model, its performance in solubility prediction was compared with other state-of-the-art graph models, including GCN [43], MPNN [20], AttentiveFP [21]. The framework was implemented using python software package dgl-Lifesci [44] (the training hyperparameters detailed in Supporting Information: Tables S1-S3). Figure 2 shows the predicted value against the experimental value from the original test datasets. The predicted values of a better model would populate more closely to the diagonal line (y = x). All four tested models demonstrate excellent predictive abilities. Among them, our BCSA model with 40× data augmentation achieves the best performance on molecular solubility prediction, reflected by a better correlation and a smaller deviation from the experimental data.

Predicting Other Related Physicochemical Properties
We extended the predictions of other relevant molecular properties, namely the oilwater partition coefficients logP and logD (pH = 7.4), with our BCSA (SMILES × 40) model (the training hyperparameters detailed in Supporting Information: Tables S4 and S5). The logP dataset is still based on the Cui et al. [15] dataset. As shown in Figure 3A, our model achieves an exciting result in the test dataset, with R 2 of 0.99 and RMSE of 0.29. The scatter plot demonstrates that the predicted data achieves excellent fitting throughout the whole range of the experimental logP values. Meanwhile, the logD (pH = 7.4) training dataset was taken from Wang et al. [45]. The dataset is randomly divided into 8:1:1. The training data are obtained using SMILES enumeration [33]

Predicting Other Related Physicochemical Properties
We extended the predictions of other relevant molecular properties, namely the oilwater partition coefficients logP and logD (pH = 7.4), with our BCSA (SMILES × 40) model (the training hyperparameters detailed in Supporting Information: Tables S4 and S5). The logP dataset is still based on the Cui et al. [15] dataset. As shown in Figure 3A, our model achieves an exciting result in the test dataset, with of 0.99 and RMSE of 0.29. The scatter plot demonstrates that the predicted data achieves excellent fitting throughout the whole range of the experimental logP values. Meanwhile, the logD (pH = 7.4) training dataset was taken from Wang et al. [45]. The dataset is randomly divided into 8:1:1. The training data are obtained using SMILES enumeration [33]  In summary, introducing the two attention modules significantly improves the prediction accuracy by 5% in both the verification set and the test set, when compared with pure BILSTM. Moreover, the BILSTM with attention model resulted in larger variances between the predicted and the ground-truth values, a sign of possible overfitting often caused by a small dataset. Therefore, the dataset in this study was enriched by enumeration of SMILES for each molecule. Our result clearly shows that the accuracy, generalizability and overfitting problem are improved with an increasing number of enumerated SMILES strings. Furthermore, our model outperforms three classical graph neural network models (GCN, MPNN, AttentiveFP) in the prediction of aqueous solubility. When trained to predict other relevant properties, logP and logD, our model also appears reliable and reaches prediction accuracy of 0.99 and 0.93, respectively.

Discussion
For accurate prediction of aqueous solubility, we proposed an end-to-end deeplearning framework, in short BCSA, which combines a BILSTM neural network and the channel and spatial attention modules. By exploiting the advantages of molecular SMILES strings as training inputs, our BCSA model would be able to capture directly the complex spatial information of connected atoms, which has posed a great challenge in previous attempts at the prediction. The overfitting problem arising from small dataset size is also In summary, introducing the two attention modules significantly improves the prediction accuracy R 2 by 5% in both the verification set and the test set, when compared with pure BILSTM. Moreover, the BILSTM with attention model resulted in larger variances between the predicted and the ground-truth values, a sign of possible overfitting often caused by a small dataset. Therefore, the dataset in this study was enriched by enumeration of SMILES for each molecule. Our result clearly shows that the accuracy, generalizability and overfitting problem are improved with an increasing number of enumerated SMILES strings. Furthermore, our model outperforms three classical graph neural network models (GCN, MPNN, AttentiveFP) in the prediction of aqueous solubility. When trained to predict other relevant properties, logP and logD, our model also appears reliable and reaches prediction accuracy of 0.99 and 0.93, respectively.

Discussion
For accurate prediction of aqueous solubility, we proposed an end-to-end deeplearning framework, in short BCSA, which combines a BILSTM neural network and the channel and spatial attention modules. By exploiting the advantages of molecular SMILES strings as training inputs, our BCSA model would be able to capture directly the complex spatial information of connected atoms, which has posed a great challenge in previous attempts at the prediction. The overfitting problem arising from small dataset size is also circumvented by SMILES enumeration, which effectively enriches the sample size for training. Successful data augmentation in our workflow would possibly be the reason for its superior accuracy over three other commonly used graph-based neural networks. These networks may be further improved via a similar data enrichment process, during which sufficient isomorphous graphs of each molecule need to be generated for training. It should also be noted that variations on the training dataset may have a strong impact on prediction accuracy. Meanwhile, the channel and spatial attention modules facilitate the identification of influential attributes between adjacent atoms in the SMILES, without incurring greater overhead in computation. Encouragingly, our BCSA model does not require additional auxiliary data for predicting logP and logD with even higher accuracy. The prediction accuracy in terms of R 2 is logP > logD > logS. Since most SMILES from the chemical dataset do not offer explicit information of hydrogen (or specifically the ionization and/or tautomerization states), the training input may possess sufficient information for predicting logP, which only accounts for the neutral form of a molecule in both oil and aqueous media. However, the ionized and neutral forms of molecules are effectively different chemical species that cannot be represented with one single SMILES and may have fairly different solubilities. Hence, the prediction performance for logD and logS may further be improved when possible tautomers are considered during the SMILES enumeration step. Nevertheless, with this advanced new algorithm, other properties such as ADMET or DMPK might be potentially predicted accurately as well if the SMILES datasets of the molecule were given. More precisely, our BCSA model, possibly in combination with alternative attention modules, also needs to be evaluated on other datasets in the future for robustness.

Molecular Dataset and Processing
The dataset derived from the work of Cui et al., 2020 [15] contains the 9943 nonredundant compounds. To predict the molecular property value of a compound, its chemical information needs to be represented in a format compatible for machine learning. SMILES format is a common choice for incorporating topological information based on chemical bonding rules. For example, cyclohexane and dioxane may be written as C1CCCCC1 and O1CCOCC1, respectively. As a "chemical language" [19] that encodes the structural information of a molecule into a compact text string under fixed rules and conventions, the SMILES of a molecule comprises simple character(s) for atoms and bonds, among which adjacent characters have high correlation in a chemical sense. Inspired by ref. [30], we tokenized the SMILES strings of drug molecules using the following regular expression: By means of word2vec [46,47], an input SMILES sequence is split into n tokens, which are then embedded into a vector of l dimensions according to the token position. Hence, the tokenized characters are eventually embedded into a 2D feature matrix M ∈ R n×l before training. Word2Vec would encode tokens into dense vectors by learning the association of the context of the SMILES string. Moreover, the dataset was expanded to SMILES enumeration [33] and the SMILES strings which were padded with "padding" to a fixed length of 150 tokens. The excess characters beyond this length were discarded directly. Finally, the dataset was randomly split into a training (80%) set and validation (10%) set and test (10%) set, respectively.

Model Building
Here we provide an overview of the proposed BCSA framework (Figure 4) and introduce the core methods in our model. In addition, we specify the implementation details and the evaluation criteria.

Model Building
Here we provide an overview of the proposed BCSA framework (Figure 4) and introduce the core methods in our model. In addition, we specify the implementation details and the evaluation criteria. Our model architecture consists of three neural networks. The first one used in this work was based on the Long Short-Term Memory (LSTM) architecture [48] which managed the remote relationship in natural-language processing. Each molecular matrix M is composed of n token vectors which are independent of each other. To obtain the correlation between adjacent tokens, a BILSTM is introduced to process a molecule with two hidden LSTM layers (forwards and backwards), which not only can encode information from front to back, but also obtain information from back to front. The SMILES string can be represented in a sequence of token embeddings as = , , , … , . LSTM creates a hidden state ℎ by forgetting the hidden state ℎ and remembering new information from molecular embedding . Adding ℎ ⃗ and ℎ ⃖ becomes a more informative vector to acquire relationships between adjacent token embeddings in a SMILES string. The output of hidden states in every step t can be defined as: where or is the learned weights, all ℎ can be turned into a simple concatenation = ℎ , ℎ , … , ℎ . Generally, C is a simple representation of the last hidden state of the BILSTM encoder.
The next goal is to locate key tokens or features corresponding to certain parts of the molecule that contributes the most to property prediction. Concretely, this helps uncovering the connections between tokens and predicted value. Chemists can then apply such knowledge to design or improve drug compounds. Therefore, we introduced a convolutional block module attention mechanism in BILSTM. In the second network, we embedded the optimized Convolution Block Attention Module (CBAM) mechanism [49] into the current forward sequential neural network. It has two modules including a channel attention map ( ) and a spatial attention map ( ), which exploit the inter-channels and interspatial relationship of features. In other words, from a spatial viewpoint, the channel attention explores the globality of the molecular hidden state ,while the spatial attention focus on local. Both attentions are used in parallel, then add to the outputs which are normalized with sigmoid function to obtain an information-enriched attention map. The overall attention process can be expressed as: Our model architecture consists of three neural networks. The first one used in this work was based on the Long Short-Term Memory (LSTM) architecture [48] which managed the remote relationship in natural-language processing. Each molecular matrix M is composed of n token vectors which are independent of each other. To obtain the correlation between adjacent tokens, a BILSTM is introduced to process a molecule with two hidden LSTM layers (forwards and backwards), which not only can encode information from front to back, but also obtain information from back to front. The SMILES string can be represented in a sequence of token embeddings as M = [x 1 , x 2 , x 3 , . . . , x n ]. LSTM creates a hidden state h t by forgetting the hidden state h t−1 and remembering new information from molecular embedding x t . Adding → h t and ← h t becomes a more informative vector to acquire relationships between adjacent token embeddings in a SMILES string. The output of hidden states in every step t can be defined as: where W e or W v is the learned weights, all h t can be turned into a simple concatenation C = {h 1 , h 2 , . . . , h n }. Generally, C is a simple representation of the last hidden state of the BILSTM encoder. The next goal is to locate key tokens or features corresponding to certain parts of the molecule that contributes the most to property prediction. Concretely, this helps uncovering the connections between tokens and predicted value. Chemists can then apply such knowledge to design or improve drug compounds. Therefore, we introduced a convolutional block module attention mechanism in BILSTM. In the second network, we embedded the optimized Convolution Block Attention Module (CBAM) mechanism [49] into the current forward sequential neural network. It has two modules including a channel attention map (M c ) and a spatial attention map (M s ), which exploit the inter-channels and inter-spatial relationship of features. In other words, from a spatial viewpoint, the channel attention explores the globality of the molecular hidden state C,while the spatial attention focus on local. Both attentions are used in parallel, then add to the outputs which are normalized with sigmoid function to obtain an information-enriched attention map. The overall attention process can be expressed as: where ⊗ denotes element-wise multiplication, σ denotes the sigmoid function, and C is the output of the overall attention model, respectively. The following introduces the details of each attention module. Channel attention module: This module focuses on 'what' is meaningful given as SMILES tokens. To extract channel attention more efficiently, the dimension of the input hidden layers feature vector C requires further reduction. Woo et al. [49] proposed using both average-pooled and max-pooled features, which greatly improves performance of the model. The training vectors of BILSTM were first aggregated using average-pooling and max-pooling. Two different descriptions were then generated: C avg and C max , which represent the average-pooled and max-pooled the training features vectors, respectively. Next, these descriptions were taken forward to a two-layers shared MLP and merging the output of average-pooled vectors (M avg (C)) and max-pooled vectors (M max (C)), using element-wise summation.
where σ denotes the ReLU function [50] which reduces the overhead of the network, and W 0 , W 1 are the related MLP weights. Spatial attention module: This module focuses on the informative part of the SMILES word vector. Here it comprised a two-layers and 1-dimensional Convolutional Neural Networks. The Spatial attention component is computed as where σ denotes the ReLU function, and Conv1d 7,x is a 1-D convolution layer with x filters and seven kernels in total. Finally, the attention values were broadcasted accordingly during element-wise multiplication operation, using the following formula: where O denotes the aggregated output of the overall attention model by a 1-D averagepooling operation.
In the third network, the abovementioned output vectors O were fed into two fully connected dense layers to predict the molecular property values. ReLU was used as the activation function, which has been widely adopted in deep-learning research. In addition, the dropout layers were also added to avoid the overfitting. The model was trained by minimizing the mean square error (MSE) in a loss function defined as: whereŷ i is the predicted value, and y i is the ground-truth solubility values of N molecules.

Hyperparameter Search
Several components, known as hyperparameters, were used to control the behavior of the learning algorithm in our model. The performance of the model can fluctuate significantly with different parameters. Here we can use Bayesian optimization [51] to select efficiently the best parameters. During optimization, the target function for minimization was defined as whereŷ i is the predicted value, y i is the ground-truth value and y is the mean value of N molecules. We performed a TPE (Tree-structured Parzen Estimator Approach) search [52] on hyperparameter space as shown in Table 1. In short, 100 models were trained for 60 epochs and an early stop strategy (patience = 20) was set, in order to accelerate training speed. The best hyperparameter space was determined when the model yields R 2 value closest to 1 using in the training set ( Table 1). The hyperparameters were later used in the validation step. If needed, the accuracy of this model could further be improved by selecting the best hyperparameter set from 30 additionally trained epochs. The framework was performed using Pytorch and all trainings were finished in a Linux server (openSUSE 15.2): Intel(R) Xeon(R) Platinum 8173M CPU @ 2.00 GHz and Nvidia GeForce RTX 2080 Ti graphics card with 11G. This machine is located internally in the Shenzhen Institute of Advanced Technology, CAS (Shenzhen, China).

Evaluation Metrics
Four performance indicators commonly used in the regression task were chosen to assess our model, including: R − Squared(R 2 ), spearman, RMSE, MAE. R 2 , spearman can monitor the good-fit capacity of our model to the dataset. Better fitting is expected when these indicators approach 1. RMSE, MAE keep track of the deviations between predicted values and experimental values. Good concordance is achieved where these indicators approach 0.
Supplementary Materials: The following supporting information can be downloaded online: Table S1: The training hyperparameters of the GCN model;  Funding: The work of Shuguang Yuan was supported by funding from Chinese Academy of Sciences, the Shenzhen Institute of Advanced Technology, CAS, Shenzhen government (grant no. JCYJ20200109114818703) as well as that from Guangdong province (grant no. 2019QN01Y306). The work was also supported by Interdisciplinary Centre for Mathematical and Computational Modeling in Poland (grant no. G07-13, GB70-3, GB84-13). We also would like to thank the support from the following grants: Shenzhen Key Laboratory (ZDSYS20201230165400001); Guangdong Retired Expert; The Chinese Academy of Science President's International Fellowship Initiative(PIFI)(2020FSB0003) as well as Shenzhen Pengcheng Scientist.
Institutional Review Board Statement: Not applicable.

Informed Consent Statement: Not applicable.
Data Availability Statement: The SMILES enumeration datasets and predicted values are available at https://github.com/summer-cola/SMILES-enumeration-datasets (accessed on 14 January 2022). The logS, logP, logD source datasets are available from the literatures [15,45] and were used as provided.