Traditional Machine and Deep Learning for Predicting Toxicity Endpoints

Molecular structure property modeling is an increasingly important tool for predicting compounds with desired properties due to the expensive and resource-intensive nature and the problem of toxicity-related attrition in late phases during drug discovery and development. Lately, the interest for applying deep learning techniques has increased considerably. This investigation compares the traditional physico-chemical descriptor and machine learning-based approaches through autoencoder generated descriptors to two different descriptor-free, Simplified Molecular Input Line Entry System (SMILES) based, deep learning architectures of Bidirectional Encoder Representations from Transformers (BERT) type using the Mondrian aggregated conformal prediction method as overarching framework. The results show for the binary CATMoS non-toxic and very-toxic datasets that for the former, almost equally balanced, dataset all methods perform equally well while for the latter dataset, with an 11-fold difference between the two classes, the MolBERT model based on a large pre-trained network performs somewhat better compared to the rest with high efficiency for both classes (0.93–0.94) as well as high values for sensitivity, specificity and balanced accuracy (0.86–0.87). The descriptor-free, SMILES-based, deep learning BERT architectures seem capable of producing well-balanced predictive models with defined applicability domains. This work also demonstrates that the class imbalance problem is gracefully handled through the use of Mondrian conformal prediction without the use of over- and/or under-sampling, weighting of classes or cost-sensitive methods.


Introduction
Drug discovery is an expensive and resource-intensive process that involves many challenges, not least within the area of toxicity [1]. A growing concern in drug development is the high attrition rate in late clinic trials due to issues related to toxicity [2]. Computational methods, e.g., computer-aided drug design, have thus become standard tools in order to improve the efficiency of the drug discovery process and to mitigate undesirable toxicity effects [3][4][5][6].
Methods such as molecular structure property modeling have been used for decades in the pharmaceutical field in order to predict important properties, e.g., biological activity, solubility and toxicity, for prioritization of compounds with respect to potential toxicity issues and experimental testing [7]. For a recent review on computational tools, see reference [8]. The increasing focus on identifying undesirable toxic effects for chemical structures of interest, real or virtual, is manifested by recent publications such as [9,10] and recent reviews on machine learning (ML) techniques by Dara and co-workers [11] and by Matsuzaka and Yashiro [12].
During the last few years, deep learning (DL) techniques have successfully been applied in various domains, e.g., natural language processing [13] as well as image analysis [14], and have increased the interest for applying such methodologies also for molecular property predictions [15].
The Mistra SafeChem research program, financed by Mistra (The Swedish Foundation for Strategic Environmental Research, Stockholm, Sweden), has the overarching aims to create a sustainable chemical industry and reduce exposure to hazardous substances [28]. One activity, among many other objectives and activities, is to develop and use in silico predictive models for early prediction and verification of hazardous properties of new molecules or materials. This includes the development of molecular property models for predicting toxicity, e.g., acute toxicity.
The well-known CATMoS benchmark dataset [29] has been used in this study for modeling acute toxicity and the results from both traditional machine learning as well as deep leaning are presented.

CATMoS Datasets
The dataset can be downloaded from reference [29]. The originally defined training and evaluation set in [29] were used. The training, validation and conformal prediction (CP) calibration sets were randomly selected from the original training set.
Two different binary classification sets were investigated: The very toxic (VT; catmos_vt) and the non-toxic (NT; catmos_nt). Table 1 shows the number of compounds in each set after standardization. The structures, represented as SMILES, were standardized using the "remove_salt_stereo" and "organic_filter" functions of the "preprocessing.py" script found in the Continuous and Data-Driven Descriptors (CDDD) GitHub repository [30], neutralized, followed by RDKiT smiles tautomer standardization [31]. Additional structures were excluded due to MolBERT sequence length errors.

RDKit Descriptors
A total of 96 different physiochemical descriptors were calculated using MolecularDe-scriptorCalculator in RDKit [31] (list of calculated descriptors is available in Supplementary Materials S2).

CDDD Descriptors
In total, 512 CDDD descriptors of length 512 were calculated using the CDDD GitHub repository code and model [30]. The RNN architecture-based translation model translating a SMILES string into its canonical SMILES was used.

Traditional Machine Learning
The binary classification models using RKDit and CDDD descriptors, respectively, were built using the Scikitlearn RandomForestClassifier [32] with default options (as part of the conformal prediction model generation, see Section 2.5 for details) [31].

Deep Learning
Two different approaches were used for generating BERT models.

MolBERT
The MolBERT binary classification models were fine-tuned using the MolBERT GitHub repository code [33] and the pre-trained model available at reference [34]. The fine-tuning network consisted of a single linear layer connected to the pooled transformer output. Default settings were used for fine-tuning. A validation set was used to avoid overfitting of the model as well as check-pointing for saving the "best" model according to the validation set. Ten models were built using different initialization seeds.
A second set of models where fine-tuned using a pre-trained model on 500 k randomly selected PubChem compounds with a validation set of 50 k compounds.

Molecular-Graph-BERT
The Molecular-graph-BERT binary classification models were fine-tuned using the Molecular-graph-BERT GitHub repository code [35] and a Molecular-graph-BERT pretrained 500 k PubChem model. The fine-tuning network consisted of a two-layer fully connected neural network attached to the transformer encoder layer output. Default settings were used for fine-tuning. A validation set was used to avoid over-fitting of the model as well as check-pointing for saving the "best" model according to the validation set. Ten models were built using different initialization seeds.

Conformal Prediction
A conformal predictor (CP) is a member of a family called confidence predictors [36]. These predictors have several useful properties for prediction tasks in biomedical research [37]. A particularly useful property of conformal prediction is that the method will result in valid predictions based on a user-defined significance level, i.e., a level of acceptable percentage of errors, given that the data is exchangeable. This property of validity is based on a mathematical proof by Volk and co-workers [36]. In this investigation, we used Mondrian (inductive) conformal prediction that guarantees validity for each of the two classes independently and finally median aggregated the conformal prediction outcomes from the 10 developed models for each compound and each class for final class assignment [38].
The nonconformist package [39], where scikit-learn algorithms such as the Random-ForestClassifier serve as a base classifier, was employed that provide the results from conformal prediction. The following expression was used in the ICP function (Condition = lambda x: x [1]) in order to enable Mondrian conformal prediction in the nonconformist package.
An in-house script was employed to perform the conversion of predictions (scoring) from the models using BERT algorithms into results from conformal prediction. This conversion involves a calibration of the output for each compound and each label (class) in the evaluation set in relation to the output predictions for the compounds of the corresponding class in the calibration set.
A CP binary classification problem can have four possible outcomes. A test (evaluation) compound can be assigned a label for either of the two classes, assigned both labels (both classification) or none of the labels (empty classification). For a detailed description on how this calibration is performed, see Norinder and co-workers [40].
A flow chart overview of the employed machine learning approaches and calibration is depicted in Figure 1. Validity and efficiency are two key measures in conformal prediction. Validity, for each class, is the percentage of correct predictions in Mondrian conformal prediction at a given significance level where the prediction contains the correct class. Thus, in binary classification the both classification is always correct (contains both available labels) while the empty classification is always erroneous (contains no labels). Models were considered valid when the resulting error rate does not exceed the set error rate (significance level) by more the 2.5%.
Efficiency, for each class, is defined as the percentage of single label predictions (only one label), regardless of whether the prediction is correct or not, at a given significance level. High efficiency is therefore desirable since a larger number of predictions are single label predictions and more informative as they contain only one class.

Results and Discussion
The aim of this study is to investigate how different molecular representations and algorithmic approaches may affect the predictive performance of the derived models. The present study therefore compares results from the traditionally used Random Forest/physico-chemical descriptor approach through an intermediate Random Forest/auto encoder representation to deep learning BERT/molecular-graph-based approaches.
The results from the study are shown in Table 2 and Supplementary Materials Table  S1 and depicted in  Validity and efficiency are two key measures in conformal prediction. Validity, for each class, is the percentage of correct predictions in Mondrian conformal prediction at a given significance level where the prediction contains the correct class. Thus, in binary classification the both classification is always correct (contains both available labels) while the empty classification is always erroneous (contains no labels). Models were considered valid when the resulting error rate does not exceed the set error rate (significance level) by more the 2.5%.
Efficiency, for each class, is defined as the percentage of single label predictions (only one label), regardless of whether the prediction is correct or not, at a given significance level. High efficiency is therefore desirable since a larger number of predictions are single label predictions and more informative as they contain only one class.

Results and Discussion
The aim of this study is to investigate how different molecular representations and algorithmic approaches may affect the predictive performance of the derived models. The present study therefore compares results from the traditionally used Random Forest/physicochemical descriptor approach through an intermediate Random Forest/auto encoder representation to deep learning BERT/molecular-graph-based approaches.
The results from the study are shown in Table 2 and Supplementary Materials Table S1 and depicted in Figures 2-5.    From Figure 2, it can be noted that most methods, both ensemble and single models, produce valid models for significance level 0.1-0.3 with the exception of one CDDD ensemble model and two single MolBERT models based on the PubChem pre-trained model. Figure 3 shows that both RDKit approaches and the BERT ensemble approaches as well as many single models result in valid models for both datasets at significance levels of primary interest, e.g., with error levels (significance levels) set at 10, 15 and 20%.  Figure 3 shows that both RDKit approaches and the BERT ensemble approaches as well as many single models result in valid models for both datasets at significance levels of primary interest, e.g., with error levels (significance levels) set at 10, 15 and 20%.    Figure 3 shows that both RDKit approaches and the BERT ensemble approaches as well as many single models result in valid models for both datasets at significance levels of primary interest, e.g., with error levels (significance levels) set at 10, 15 and 20%.  . Evaluation set efficiency for class "1" for the 2 datasets (NT model upper row, VT model lower row), at significance levels 0.1-0.2, for each method. Class "1": non-toxic class and very toxic class for the 2 datasets nt and vt, respectively. Methods: cddd = RF/cddd 10 models, mg_bert = Molecular-graph-BERT/smiles 10 models, molbert = MolBERT/smiles 10 models, mol-bert_p = MolBERT/smiles 10 models with PubChem pre-trained model, rdkit = RF/rdkit 10 models, xxx_1 is the corresponding approach based on only 1 model. class for the 2 datasets nt and vt, respectively. Methods: cddd = RF/cddd 10 models, mg_bert = Molecular-graph-BERT/smiles 10 models, molbert = MolBERT/smiles 10 models, molbert_p = Mol-BERT/smiles 10 models with PubChem pre-trained model, rdkit = RF/rdkit 10 models, xxx_1 is the corresponding approach based on only 1 model.  5 show that the efficiencies, i.e., the percentage of single label predictions, for both class 1 and 0 at an acceptable error level of 10% are, on average, only 60-70%, which, for most cases, cannot be considered sufficient. At 15% acceptable error rate both the MolBERT and RDKit approaches show efficiencies for both class 1 and 0 close to or above 80% which ensures a large portion of single-label predictions. The efficiency is further increased at 20% acceptable error rate where all approaches have efficiencies well above 80% for both class 1 and 0. The performance is generally somewhat better for the catmos_vt dataset compared to the catmos_nt dataset.
The class distribution (class "0"/class "1") is ~1.6:1 for catmos_nt and ~11:1 for cat-mos_vt which means that the former data set is rather balanced while the latter is rather unbalanced. This imbalance may, in turn, cause some issues for ML algorithms to properly handle the minority class [41][42][43].
From the catmos_nt results presented in Table 2 it can be noted that all of the developed models in this study performs similarly with respect to the SE/SP balance with an absolute average difference of 0.039.
The catmos_vt results presented in Table 2 show the graceful handling of the class imbalance in this dataset, by an absolute average difference between SP and SE for the four models (cddd not valid for class 1) in this investigation of 0.019, by using Mondrian  5 show that the efficiencies, i.e., the percentage of single label predictions, for both class 1 and 0 at an acceptable error level of 10% are, on average, only 60-70%, which, for most cases, cannot be considered sufficient. At 15% acceptable error rate both the MolBERT and RDKit approaches show efficiencies for both class 1 and 0 close to or above 80% which ensures a large portion of single-label predictions. The efficiency is further increased at 20% acceptable error rate where all approaches have efficiencies well above 80% for both class 1 and 0. The performance is generally somewhat better for the catmos_vt dataset compared to the catmos_nt dataset.
The class distribution (class "0"/class "1") is~1.6:1 for catmos_nt and~11:1 for catmos_vt which means that the former data set is rather balanced while the latter is rather unbalanced. This imbalance may, in turn, cause some issues for ML algorithms to properly handle the minority class [41][42][43].
From the catmos_nt results presented in Table 2 it can be noted that all of the developed models in this study performs similarly with respect to the SE/SP balance with an absolute average difference of 0.039.
The catmos_vt results presented in Table 2 show the graceful handling of the class imbalance in this dataset, by an absolute average difference between SP and SE for the four models (cddd not valid for class 1) in this investigation of 0.019, by using Mondrian conformal prediction. Furthermore, this balanced performance was the result of running the Mondrian CP framework in default mode and without the use of over-and/or undersampling, weighting of classes or cost-sensitive measures. The more balanced results from this investigation with respect to SE and SP are due to the independent handling of the two classes as part of the Mondrian conformal prediction calibration procedure; see Section 2.5 for more details.
The well-balanced SE/SP performance is of importance from the point of safety, i.e., not to err on the false negative side, when predicting a toxic compound to be non-toxic for the catmos_vt model. It is less of a problem for the catmos_nt model if a few more non-toxic compounds are predicted to be toxic from a safety perspective.
All methods in Table 2 are performing equally well on the balanced catmos_nt dataset while MolBERT, based on the larger pre-trained model (method "molbert" in Table 2), seems to be performing somewhat better than the other methods for the catmos_vt dataset with respect to SE and SP and the balance between them based on BA results from the 10 individual models (95% confidence, Mann-Whitney U test with Bonferroni correction for multiple testing) constituting the molbert ensemble results.
Advantages of the BERT approaches over the RDKit descriptor-based approach is that they are descriptor-free in that SMILES are used as input without the need for explicit descriptor generation prior to modeling and that the results from the models can be projected back onto the atoms of the molecules through their attention mechanisms. The advantage of the RDKit and CDDD approaches is shorter computation costs and that smaller datasets can be modeled with acceptable outcomes compared to using deep learning in the form of BERT that usually requires larger training sets.

Conclusions
The results show for the binary CATMoS non-toxic, almost equally balanced dataset that all methods perform equally well. The results also show that the MolBERT model based on a larger pre-trained network performs somewhat better compared to the rest for the binary CATMoS very-toxic dataset with an 11-fold difference between the two classes. The descriptor-free, SMILES based, deep learning BERT architectures seem capable of producing well-balanced predictive models with defined applicability domains. This work also demonstrates that the class imbalance problem is gracefully handled through the use of Mondrian conformal prediction without the use of over-and/or under-sampling, weighting of classes or cost-sensitive methods.