Virtual Screening for Reactive Natural Products and Their Probable Artifacts of Solvolysis and Oxidation

Chemically unstable natural products are prone to show their reactivity in the procedures of extraction, purification, or identification and turn into contaminants as so-called “artifacts”. However, identification of artifacts requires considerable investments in technical equipment, time, and human resources. For revealing these reactive natural products and their artifacts by computational approaches, we set up a virtual screening system to seek cases in a biochemical database. The screening system is based on deep learning models of predicting the two main classifications of conversion reactions from natural products to artifacts, namely solvolysis and oxidation. A set of result data was reviewed for checking validity of the screening system, and we screened out a batch of reactive natural products and their probable artifacts. This work provides some insights into the formations of natural product artifacts, and the result data may act as warnings regarding the improper handling of biological matrixes in multicomponent extraction.


Introduction
The large diversity of natural products from one biological source leads to difficulties in multicomponent extraction. In the black box of exploring undiscovered natural products, options of handling, storage, or analysis of a biological matrix are often empirical or semiempirical. Under such circumstances, some chemically unstable molecules are prone to show their reactivity in the procedures of extracting natural products and turn into contaminants as so-called "artifacts" [1]. The artifacts arise from the products of non-enzymatic reactions during the process of extraction or purification [2][3][4]. In most cases, these reactions are between natural products and solvents or chromatography media [1][2][3][4]. On the other hand, oxidation of natural products when exposed to air or light is also common [3,4]. The revealing of artifacts usually occurs in chemical experiments on a case-by-case basis, and extensive artifact search and discovery may rely on computational approaches of virtual screening or data-mining. Few extensive searches have been done because of technical limitations and deficiency of specialized data resources in natural product chemistry [5,6].
New computational approaches for explorations in chemistry were booming in the last decade, particularly in the application of machine learning (ML) [7,8]. Researchers applied various algorithms of ML such as deep learning (DL) to design novel molecules, predict chemical properties, or plan reaction paths [9][10][11][12][13][14][15]. Neural networks were applied to reaction prediction in some studies by ranking electron sources and sinks or generating reaction fingerprints [10,14]. Further applications regarded chemical reactions as transformations [13]. The transformations can be considered as translations from reactants to products, and the "language" being translated is the structural representation of molecules. In the mechanism insights into the formations of natural product artifacts, we can regard the formations as transformations from natural products to artifacts, and we can predict the transformations by computational approaches. With the advantage of using ML, we might have an efficient and convenient approach to identify specific classifications of reactions, instead of building complex models. Therefore, we pursued an exploration of seeking more cases of chemically reactive natural products and their probable artifacts that have not been documented. We set up a deep-learning-based virtual screening system for discovering these extraordinary natural products in a specialized data set.

Materials and Methods
According to investigation into studies that reported artifacts, the transformations of natural products to artifacts are reactions in specific classifications (e.g., solvolysis and oxidation) [1][2][3][4]. Identifying these reactions means that seeking out reactive natural products (reactants of these reactions) and their artifacts (products of these reactions) and using computational approaches may be a better approach than the use of chemical experiments in consideration of investments in technical equipment, time, and human resources. We herein take advantage of virtual screening, which is applicable for the task of searching for and discovering exceptional molecules in a database, and use virtual screening to target reactive natural products. In the theoretical base of the virtual screening used in this study, the core idea is to determine specific classifications of reactions that cause artifacts. We realized this conception by using ML to predict probable products of these reactions. If a natural molecule and its predicted product are derived from the same biological source, we have a theoretical clue to suspect that the molecule is a reactive natural product and the predicted product is its artifact. Therefore, we can seek for potential cases by checking for the existence of these reactions in a specialized data set. The specialized data set used in this study is a biochemical database (http://www.organchem.csdb.cn/scdb/NPBS) [16]. In this data resource, the relational data (relationship between a specific biological source and all the natural products derived from it reported by various studies) includes sufficient natural products from various biological sources. An example of a set of relational data listed in Table 1 describes 10 natural products from Thalictrum delavayi. More detailed example data are included in the Supplementary Materials.
We assumed that a small fraction of the natural products were reactive in the process of extraction and there were corresponding artifacts extracted from the same biological source. In that case, a reactive natural product and its artifact would probably coexist in a set of relational data (reported by one or more studies). The reactive natural product and its artifact would form a set of reaction data of a reactant and a product, and the specific reaction could be predicted by our trained models. According to the features of the data set, we designed a virtual screening strategy as follows (also as shown in Figure 1):

1.
Take a set of relational data (a specific biological source and all the natural products derived from it); 2.
Take one of the natural product molecules in this relational data set; 3.
Predict its solvolysis and oxidation products by neural network models; 4.
If predictions of the models are successful (or partially successful), match the predicted products with the other natural product molecules from the same biological source; 5.
If a predicted product matches one of the other natural product molecules, label the natural product and the predicted product as a potential case; 6.
Go through steps 2-5 with all the other molecules in the same relational data set; 7.
Go through steps 1-6 with all the other relational data sets and screen out all the potential cases in the data set.
5. If a predicted product matches one of the other natural product molecules, label the natural product and the predicted product as a potential case; 6. Go through steps 2-5 with all the other molecules in the same relational data set; 7. Go through steps 1-6 with all the other relational data sets and screen out all the potential cases in the data set. Figure 1. Illustration of the virtual screening system for discovering reactive natural products and their probable artifacts [17,18]. can predict the transformations by computational approaches. With the advantage of using ML, we might have an efficient and convenient approach to identify specific classifications of reactions, instead of building complex models. Therefore, we pursued an exploration of seeking more cases of chemically reactive natural products and their probable artifacts that have not been documented. We set up a deep-learning-based virtual screening system for discovering these extraordinary natural products in a specialized data set.

Materials and Methods
According to investigation into studies that reported artifacts, the transformations of natural products to artifacts are reactions in specific classifications (e.g., solvolysis and oxidation) [1][2][3][4]. Identifying these reactions means that seeking out reactive natural products (reactants of these reactions) and their artifacts (products of these reactions) and using computational approaches may be a better approach than the use of chemical experiments in consideration of investments in technical equipment, time, and human resources. We herein take advantage of virtual screening, which is applicable for the task of searching for and discovering exceptional molecules in a database, and use virtual screening to target reactive natural products. In the theoretical base of the virtual screening used in this study, the core idea is to determine specific classifications of reactions that cause artifacts. We realized this conception by using ML to predict probable products of these reactions. If a natural molecule and its predicted product are derived from the same biological source, we have a theoretical clue to suspect that the molecule is a reactive natural product and the predicted product is its artifact. Therefore, we can seek for potential cases by checking for the existence of these reactions in a specialized data set. The specialized data set used in this study is a biochemical database (http://www.organchem.csdb.cn/scdb/NPBS) [16]. In this data resource, the relational data (relationship between a specific biological source and all the natural products derived from it reported by various studies) includes sufficient natural products from various biological sources. An example of a set of relational data listed in Table 1 describes 10 natural products from Thalictrum delavayi. More detailed example data are included in the Supplementary Materials. can predict the transformations by computational approaches. With the advantage of using ML, we might have an efficient and convenient approach to identify specific classifications of reactions, instead of building complex models. Therefore, we pursued an exploration of seeking more cases of chemically reactive natural products and their probable artifacts that have not been documented. We set up a deep-learning-based virtual screening system for discovering these extraordinary natural products in a specialized data set.

Materials and Methods
According to investigation into studies that reported artifacts, the transformations of natural products to artifacts are reactions in specific classifications (e.g., solvolysis and oxidation) [1][2][3][4]. Identifying these reactions means that seeking out reactive natural products (reactants of these reactions) and their artifacts (products of these reactions) and using computational approaches may be a better approach than the use of chemical experiments in consideration of investments in technical equipment, time, and human resources. We herein take advantage of virtual screening, which is applicable for the task of searching for and discovering exceptional molecules in a database, and use virtual screening to target reactive natural products. In the theoretical base of the virtual screening used in this study, the core idea is to determine specific classifications of reactions that cause artifacts. We realized this conception by using ML to predict probable products of these reactions. If a natural molecule and its predicted product are derived from the same biological source, we have a theoretical clue to suspect that the molecule is a reactive natural product and the predicted product is its artifact. Therefore, we can seek for potential cases by checking for the existence of these reactions in a specialized data set. The specialized data set used in this study is a biochemical database (http://www.organchem.csdb.cn/scdb/NPBS) [16]. In this data resource, the relational data (relationship between a specific biological source and all the natural products derived from it reported by various studies) includes sufficient natural products from various biological sources. An example of a set of relational data listed in Table 1 describes 10 natural products from Thalictrum delavayi. More detailed example data are included in the Supplementary Materials. In Step 4 of the procedure, the success of predictions is judged based on the validity of the SMILES strings for molecular structure generated by the models, and the judgment is made by RDKit. In the vast majority of successful cases, only one model among the models we built generated a valid SMILES string and could be described as "partially successful".
Available information on transformations from natural products to artifacts is rare and implicit in the literature. A set of preliminary data was extracted from studies where such information was available [1][2][3][4]. The preliminary data set is paired with molecules as natural products transform into artifacts. With the knowledge of these transformations from the preliminary data set, we expanded analogous transformations to common chemical reactions in specific classifications from a reaction database [19]. The reactions were classified based on the two main causes of artifacts: solvolysis and oxidation [1][2][3][4]. The reactions of solvolysis are compounds reacted with or in solvents. Solvents or media such as methanol, ethanol, acetone, dichloromethane, chloroform, and water are commonly used in natural product extraction [1][2][3]. The reactions of oxidation are compounds transformed into oxides with the effect of air, light, or heat [4]. The data set was made up of reactants (except solvents, catalysts, or other participants) and products (except by-products) from the reaction data set. We used these data as the training data set for our deep-learning-based approach. For normalization of the data, the structural representations of reactants and products are canonicalized SMILES strings using an implicit representation of hydrogen atoms [10,20]. The processed data set is included in the Supplementary Materials.
Convolutional neural networks (CNNs) are deep learning architectures well suited to the translation of variable-length sequences such as text sentences [21,22]; herein, we extrapolate such techniques to SMILES strings of molecular structures. In the theoretical base of the used virtual screening, the core idea is to determine the specific classifications of reactions that cause artifacts, and we realized this conception by using CNN models to predict the probable products of these reactions. Thus, we applied an attention-based CNN model for predicting the reactions of natural products to artifacts [23]. We dealt with the transformations of SMILES strings as language translation, taking the reactants as source sentences and the products as target sentences. The neural network model conceptually consists of four elements: an encoder of three one-dimensional CNN layers that encodes the input character sequence, a decoder of three one-dimensional CNN layers that turns the target sequences into the same sequence but offset by one timestep in the future, attention mechanism layers that take the outputs of the encoder and decoder, and a decoder of two one-dimensional CNN layers that decodes the output character sequence, as shown in Figure 2. The input SMILES strings of natural products are transformed into embedding sets of vectors. The number of vectors equals the number of unique characters in all input SMILES strings and is provided as an input to the encoder-decoder model with attention mechanism. The output SMILES strings are reversed from predicted sequences by re-embedding.
The models were trained on seven classifications of reaction from the training data set: solvolysis of methanol, ethanol, acetone, dichloromethane, chloroform, and water and oxidation. The training data for CNN models were from the reaction data set described above. We split the data set for cross-validation at random, 80% for training set and 20% for validation set. We took the reactants of the reaction data as source data, taking the products as target data. The parameters of the neural networks were chosen according to the performances on the validating set (key hyperparameters of the best-performing CNN models are listed in Table 2), and other parameters remained unchanged as default settings of the used neural network architecture [21][22][23][24][25][26]. We obtained the top percentages of correctly predicted products in seven classes, as listed in Table 3. We used the best-performing models to predict the potential transformations of natural products to artifacts. The models were implemented in Python 3.7 using Keras 2.3 and TensorFlow backend [24][25][26]. The Python code for generating the neural network models is included in the Supplementary Materials. We applied RDKit in Python for generating SMILES strings and processing molecular structures [27]. sequences into the same sequence but offset by one timestep in the future, attention mechanism layers that take the outputs of the encoder and decoder, and a decoder of two one-dimensional CNN layers that decodes the output character sequence, as shown in Figure 2. The input SMILES strings of natural products are transformed into embedding sets of vectors. The number of vectors equals the number of unique characters in all input SMILES strings and is provided as an input to the encoder-decoder model with attention mechanism. The output SMILES strings are reversed from predicted sequences by re-embedding.   Success: percentage of valid SMILES strings for molecular structure generated by the models; Concordance: average sequence match ratio of target and predicted SMILES strings (0 = totally different, 1 = exact match); Accuracy: percentage of chemical structure identification (same InchiKey) between target and predicted SMILES strings.

Results and Discussion
We first obtained a set of natural products and successfully predicted products from the seven CNN models. The first result data set consists of molecular information of the natural products and predicted products, along with the specific CNN model that generated the SMILES strings of predicted products, that would form a group of reactive natural products and their probable artifacts with biological source information in our virtual screening system according to the theoretical base of this work. Results from the virtual screening system were reviewed to check the validity of our approach and seek positive data. We eventually screened out 118 cases of reactive natural products and their probable artifacts from the biochemical database. The result data set consists of reactive natural products, probable artifacts, biological sources, probable causes, and references (data sources for biological sources and natural products). The complete result data sets and the trained model files of this work are included in Supplementary Materials. Some of the cases are listed in following figures as discussions of typical examples we found, and the original images of these figures are also included in Supplementary Materials as ChemDraw files.

Conclusions
The architecture of the neural networks (CNNs) is well suited to the translation of variablelength sequences, such as text sentences and, as used in this work, the SMILES strings of molecular structures. However there may be practical limitations for wider chemical spaces, seeing that the CNNs are more applicable for translation of short sentences [52]. In the case of large molecules or synthetic reactions, the length of SMILES strings and the complexity of the data space have restricted such techniques, preventing them from being used in wider applications.
Although the transformations (or reactions) from natural products to artifacts predicted by neural networks are restrained to the superficial level, the predictions lacking information related to chemical mechanism, and the virtual screening strategy relies on relational data and assumptions. The potential reactivity of molecules determined just by inspection of data may be without chemical proof, and there are some products of transformations that may not actually be natural product

Conclusions
The architecture of the neural networks (CNNs) is well suited to the translation of variable-length sequences, such as text sentences and, as used in this work, the SMILES strings of molecular structures. However there may be practical limitations for wider chemical spaces, seeing that the CNNs are more applicable for translation of short sentences [52]. In the case of large molecules or synthetic reactions, the length of SMILES strings and the complexity of the data space have restricted such techniques, preventing them from being used in wider applications.
Although the transformations (or reactions) from natural products to artifacts predicted by neural networks are restrained to the superficial level, the predictions lacking information related to chemical mechanism, and the virtual screening strategy relies on relational data and assumptions. The potential reactivity of molecules determined just by inspection of data may be without chemical proof, and there are some products of transformations that may not actually be natural product artifacts. For example, there are some oxidized variants of natural products that are either secondary metabolites themselves or represent the action of further metabolism in the producing organism in detoxifying a compound or preparing it for excretion; therefore, it may be arbitrary to suggest that the oxides are all artifacts. However, the results of this work provide some insights into the formations of natural product artifacts.
Although artifacts are unexpected contaminants, exploiting those transformations may inspire the synthesis of new chemical diversity. The result data with biological source information can act as warnings regarding the improper handling of biological matrixes in multicomponent extraction. This work is far from authenticating the artifacts experimentally, and some of the transformations seem impossible, but we hope the relationships and information obtained from the specialized data set provide some knowledge of reactive natural products and their artifacts in natural product chemistry.