TranScreen: Transfer Learning on Graph-Based Anti-Cancer Virtual Screening Model

: Deep learning’s automatic feature extraction has proven its superior performance over traditional ﬁngerprint-based features in the implementation of virtual screening models. However, these models face multiple challenges in the ﬁeld of early drug discovery, such as over-training and generalization to unseen data, due to the inherently unbalanced and small datasets. In this work, the TranScreen pipeline is proposed, which utilizes transfer learning and a collection of weight initializations to overcome these challenges. An amount of 182 graph convolutional neural networks are trained on molecular source datasets and the learned knowledge is transferred to the target task for ﬁne-tuning. The target task of p53-based bioactivity prediction, an important factor for anti-cancer discovery, is chosen to showcase the capability of the pipeline. Having trained a collection of source models, three di ﬀ erent approaches are implemented to compare and rank them for a given task before ﬁne-tuning. The results show improvement in performance of the model in multiple cases, with the best model increasing the area under receiver operating curve ROC-AUC from 0.75 to 0.91 and the recall from 0.25 to 1. This improvement is vital for practical virtual screening via lowering the false negatives and demonstrates the potential of transfer learning. The code and pre-trained models are made accessible online.


Introduction
Drug development is a long and costly process during which a drug candidate is discovered and widely tested to be both efficient and safe. This process can take an average of 12 years with billions of dollars spent per drug [1,2]. The early stages of this process involve discovery of a drug candidate which is bio-active towards the targeted disease and is non-toxic for humans. Since High Throughput Screening (HTS) of big library of molecules for discovery of a potent scaffold is very inefficient, for decades, scientists have been working on modeling the bio-activity in silico and virtually screening the molecules. Since the screening takes place in simulation with no wet-lab effort, the cost and time of early drug discovery can be drastically decreased.
Traditionally, molecular descriptors and fingerprints are used to extract features from the input molecules, which are then passed to a machine learning model for training. This pipeline has been used for many virtual screening tasks such as kinase inhibition prediction [3], side-effect prediction [4], cytotoxicity prediction [5], and anti-cancer agent prediction [6]. In the recent years, deep learning models have proven to be capable, and in some cases superior [7], virtual screening tools for predicting the bio-activity of given molecules. The automatic feature extraction offered by deep learning models has been demonstrated to enable de novo drug design [8], Pharmacokinetics profile prediction [9], and bio-activity prediction [10]. Since the performance and accuracy of the screening models have a direct effect on the outcome of drug development pipelines [11], deep learning offers practical virtual screening. However, deep learning models are over-parameterized and data hungry models, thus face challenges in the virtual screening domain. These challenges are at heart of what this work examines and aims to address.
One of the main challenges of virtual screening is over-fitting on the imbalanced and small training datasets [12]. In most molecular training datasets, the active molecules are rare and make up the minority distribution in the dataset, with inactive molecules outnumbering them heavily. Moreover, the number of total data points within available datasets is low due to the cost of screening in vitro. The significance of this challenge becomes palpable when a virtual screening model is trained on a non-diverse training dataset and tested on a large and diverse dataset. This scenario is often the case in many virtual screenings for drug discovery [13], and needs to be addressed for models to be practical in real-world applications.
A handful of solutions have been adapted to virtual screening from other domains of deep learning to battle this challenge. For virtual screening drug discovery, the problem of low and imbalanced data is handled traditionally using expert-made features [14], and in more recent years with a few applying multitask learning [15], few-shot learning [16], and unsupervised pre-training [17], with results showing performance improvement or deterioration in various cases. Transfer learning, which is the focus of this work, allows better initialization of the models and alleviates the problems caused from over-parameterization and imbalanced datasets. A wide-scale study of transfer learning, a collection of models to transfer from, and the study of models' behavior are lacking from the virtual screening domain.
The main aim of this work is to apply transfer learning for a virtual drug screening in a wide-scale manner. A p53-based dataset is chosen as the virtual screening task due to its importance for anticancer discovery, the imbalanced property of the dataset, and the fact that high sensitivity to weight initialization is observed in its baseline model. The results are compared to a related work, which uses reinforced multi-task learning to classify the same task [18]. The behaviors of the models are analyzed via ranking the model's predicting capability before its training on target data. The main contributions of this work are:

•
TranScreen pipeline: A practical pipeline is developed, which enables the usage of graph convolutional neural networks (GCNNs) for virtual screening and transferring the learned knowledge between multiple molecular datasets.

•
Creation of a collection of weights, which can be used as network initializations.

•
Comparing three methods of ranking models before fine-tuning takes place to select the model for future tasks.

Overview of the TranScreen Pipeline
The pipeline implemented in this work aims to apply transfer learning to a graph-based virtual screening model in a practical manner. The source datasets (from MoleculeNet) used for transfer learning and the target dataset (related to cancer) offer simple molecular-input line-entry system (SMILES) strings as input, and bio-activity or inhibition class as output to the models. The datasets are preprocessed and partitioned based on scaffold into training, validation, and test splits. As seen in Figure 1, one common network architecture and set of hyper-parameters are chosen. The source datasets are used to train multiple GCNN models with the common architecture in their related task. The pre-trained networks are then transferred to the target task and the models are fine-tuned. The models are ranked based on how well they perform on the target test dataset. Three different Big Data Cogn. Comput. 2020, 4, 16 3 of 20 approaches are implemented which use the source data, the deep features of the target validation dataset, or zero-shot inference to predict the rank of the pre-trained models before fine-tuning.
Big Data Cogn. Comput. 2020, 4, 16 3 of 20 Figure 1. The overview of the TranScreen Pipeline. A common network architecture is determined and used to train multiple source models. The pre-trained models are fine-tuned on the target dataset. Three approaches are implemented to predict how well these pre-trained models can perform on the target task before fine-tuning.

Source Data
MoleculeNet [19] is a large-scale molecular database designed to enable machine learning model creation for molecular tasks. This database offers a unique collection of multiple tasks and diverse molecules, which makes it an ideal choice for transfer learning. The source datasets used in the TranScreen pipeline are chosen from MoleculeNet datasets that have SMILES information. These source datasets originate from six different datasets, namely PCBA, MUV, HIV, BACE, Tox21, and SIDER, consisting of in total 182 tasks (assays) and 582,914 molecules. The datasets are not combined, yet each task is taken as an independent dataset, creating 182 different source datasets which can be used to train the same number of source models. We have provided detailed statistics for each dataset in Table 1. Detailed information about name of the tasks is provided in the Appendix B.  Three approaches are implemented to predict how well these pre-trained models can perform on the target task before fine-tuning.

Source Data
MoleculeNet [19] is a large-scale molecular database designed to enable machine learning model creation for molecular tasks. This database offers a unique collection of multiple tasks and diverse molecules, which makes it an ideal choice for transfer learning. The source datasets used in the TranScreen pipeline are chosen from MoleculeNet datasets that have SMILES information. These source datasets originate from six different datasets, namely PCBA, MUV, HIV, BACE, Tox21, and SIDER, consisting of in total 182 tasks (assays) and 582,914 molecules. The datasets are not combined, yet each task is taken as an independent dataset, creating 182 different source datasets which can be used to train the same number of source models. We have provided detailed statistics for each dataset in Table 1. Detailed information about name of the tasks is provided in the Appendix B.

Target Data
The p53 gene is mutated in roughly 50-60% of all human cancers [20], making it an important target for the understanding and treatment of cell abnormality. As it is mentioned in Appendix A, it is not entirely clear what pathways would become involved after p53 mutation and loss of function. Therefore, not all molecular targets of the resultant cancers are known, and prediction of candidate drugs is not always feasible. Therefore, we identify four p53-based datasets (PCBA-aid902/aid903/aid904/aid924) in which high throughput screening assays were performed to discover potential anticancer compounds. Since the molecules need to be potent against a cell line with no p53 expression (complete loss of function), which is still mutated and cancerous, PCBA-aid904 [21] is chosen as the target dataset, which used a p53 null cell in a non-permissive temperature assay. By doing so, scaffolds selectively inhibiting cancer cells with loss of p53 functionality can be predicted. Since the target data is from PCBA dataset of the MoleculeNet, it is removed from the source data. The three other tasks, namely PCBA-902, PCBA-903, and PCBA-924, were also deleted due to the close relation they had with the target dataset. The information regarding the target dataset is presented in Table 2. Due to the low number of active compounds, this dataset is highly imbalanced and only 0.12% of the molecules show bio-activity. During preprocessing, SMILES with bio-activity data are read from each dataset. The input SMILES do not need to be canonical since the model will rearrange the input atoms in a preset order. Chirality information can be included in each input in order to distinguish between isomers. For each task, 80% of the compounds are selected as training samples, while the rest are partitioned as test and validation dataset. Dataset partitioning is done in regards to scaffolds of the molecules, which ensures that similar molecules are put in same data splits and increases the variation between training, test, and validation splits.
The validation set is aimed to be used to tune the process and to find the optimum model. If this validation set is chosen at random from a non-diverse dataset, the model may be prone to memorization of specific features that represent the homogenous training distribution. The random partitioning will decrease the trained model's ability to be applied to unseen test datasets. Therefore, dataset partitioning is implemented based on scaffolds of the molecules to improve generalization abilities of the learned representations and increase the practicality of the model in real-world scenarios.

Graph Convolutional Neural Networks
Traditionally, virtual screening models take fingerprints of the molecules as input representation [14]. One of the most popular fingerprint creation techniques is Extended Connectivity Fingerprint [22], which encodes the existence of specific sub-structures of the molecule into a binary array. In recent years, this technique has been improved with the addition of GCNN models [10]. These deep learning models learn from the data to extract useful feature representations during training, while building on the same concepts of circular fingerprints.
GCNNs have been successfully applied to many tasks in drug discovery such as drug-target interaction prediction [23], physiochemical properties prediction [15], and chemical reaction prediction [24]. Part of this success is owed to the fact that molecules inherently resemble graphs, with nodes representing atoms and edges representing the bonds between the atoms. In this work, the DeepChem [25] implementation of GCNNs with Tensorflow backend are used. This framework also handles the conversion of molecules to graphs with featurization of atom types, atom bonds, number of Hydrogens, and formal charge.

Common Network Architecture
One common model architecture and set of parameters need to be defined so that all models within the pipeline can be transferred to the target task. An architecture that has proven to perform well on the molecular dataset or task can be chosen as the common architecture. Alternatively, hyper-parameter optimization can be done using a grid search over various parameters and using the validation data to find a best-performing model. However, in many cases in virtual screening, the validation data is highly imbalanced and the effects of active molecules on the result of the grid search are diminished. Therefore, we suggest using data augmentation in the form of SMILES Enumeration [26], in order to make more copies of the active molecules and make the validation set balanced. Thus, more importance is put on finding active molecules when searching for optimum hyper-parameters. In this work we use an architecture that has proven to perform well on the PCBA dataset as the common architecture. This architecture is adapted from our previous work [27] which implements the aforementioned data augmentation and hyper-parameter optimization. The details of the common architecture are given in Table A1.

Baseline Model and Internal Validation
After the data is preprocessed and the model is defined, training on the target task can begin. The training set from the target data is used to train the GCNN with random initialization for 30 epochs. Over-training was avoided using internal validation; the model is saved at each epoch during training and the performances of the model on the training set and the validation set are compared. In a healthy training period, the model's performance is improved over time on both sets. However, when over-training occurs, a decline in performance is observed on the validation set. The model from the last epoch before over-training, which is the second epoch, is taken as the base-line model.

Transfer Learning and Fine-Tuning
Machine learning has been successfully applied to many fields such as natural language processing [28], speech recognition [29], structured data [30], and arguably most predominantly to the image domain [31]. The current state of the solutions to low and imbalanced data in the image domain include the use of data augmentation [32], multitask learning [33], few-shot learning [32], and general transfer learning [34]. The training procedure of the deep learning models can be highly sensitive towards the weight initialization. Authors in [35] demonstrated that in most initializations there is a winning set of weights that become dominant during training. The stochastic gradient descent procedure will focus on this sub-structure of the network, making the rest of the network susceptible to be removed during pruning. This sheds a new light on how network initialization affects the training and performance of a deep learning model. Unfortunately, the situation is exacerbated when deep learning models are used as virtual screening models, since they are over-parameterized models that are data-hungry and are faced with imbalanced, non-diverse, or small datasets.
The state-of-the-art solution in dealing with the initialization challenge would be the use of transfer learning [34]. This process can include the transfer of weights from a pre-trained model on one domain (source) to a model on another domain (target). These weights represent what the source model has learned from the source data and the patterns used to extract features from the data. Transfer learning has shown to improve performance in many cases, but also hurt the performance in some [36]. Transfer learning in virtual screening has also been sparsely implemented in the forms of multitask learning [15] and unsupervised pre-training [17].
In this work, a separate GCNN model is trained for each source task, making a total of 182 source models. These tasks are of different data size, data diversity, and biological origin. Each model is Big Data Cogn. Comput. 2020, 4, 16 6 of 20 trained for 30 epochs and the weights are saved and transferred to the target task. The models are then fine-tuned on the target training set for one epoch.

Model Rank Prediction Before Fine-Tuning
After the models are trained, fine-tuned, and evaluated they can be ranked based on how well they performed on the target test dataset. An interesting question arises: can this ranking be predicted before fine-tuning on the target dataset was initiated? If there can be a rank prediction method, useful recommendations can be made for future tasks and there would not be a need to apply large scale transfer learning from all 182 models. This question is a derivative of domain shift [37] and requires the comparison of the nature of datasets and models from the source and target domains. There are two main approaches in the literature for ranking pre-trained models, either via training an alternative model [38,39] or via statistical methods [36,40,41]. In this work we implement two of the statistical methods and offer a third solution in order to take a step forward in model rank prediction.

Inter-Dataset Similarity Comparison
The intuitive solution for model ranking prediction is to examine the source and target datasets and find the similarity between them [42]. If the model has seen similar data during training on the source dataset, it might have learned useful representations for the target task. This intuition can also be seen in traditional virtual screening models, which are based on the concept that molecules that share common sub-structures (i.e., fingerprints) and have similar bio-activity. In this work, this method is adopted from time-series domain in the form of Inter-Dataset Similarity (IDS) comparison [36] and is implemented using ECFP molecular fingerprints. The Jaccard (Tanimoto) Coefficient is used to find the pair-wise similarity between molecules from the source dataset and the molecules from the target dataset. The results are averaged over the maximum of 10,000 molecules, due to the large size of calculations. Higher Jaccard Index represents more similarity between the datasets.

Mean Silhouette Coefficient on Deep Features
The second solution for ranking prediction is to understand how well the models distinguish between active or inactive target molecules. If the model's inner representations are able to discriminate based on the activity of target molecules, it might be easier to fine-tune the model and thus perform better on the target task. In order to do so, the Mean Silhouette Coefficient (MSC) calculation is adopted from the time-series domain [41] as the ranking prediction metric. This metric is used to evaluate the efficiency of a clustering algorithm based on how distinguished the clusters are from each other. This metric is applied to features extracted from the second to last layer of the model, in order to judge how well these features distinguish between active and inactive molecules from the target validation dataset. Higher MSC represents better discrimination between the clusters.

Zero-Shot Inference
One final approach to ranking prediction, which is proposed in this work, is to simply let the pre-trained model classify the target validation data without any fine-tuning, i.e., zero-shot inference. The intuition behind this approach closely follows that of the last two approaches. First, if the source dataset is similar to that of the target dataset, the model might perform better after fine-tuning. This similarity was tested in regards to the molecules in IDS, but is applied to the bio-activity labels in zero-shot inference. Second, if the model has learned to distinguish between active and inactive molecules of the target dataset, it might have a better performance after fine-tuning. The main difference to MSC is that now the ROC-AUC of predictions on the target validation dataset is used as a metric. This difference forces the knowledge learned within the last layer of the model to also be incorporated into the ranking metric, which was absent from the MSC method.

Evaluation
Performance of the models was evaluated using three different evaluation metrics, including accuracy rate, recall, and area under the Receiver Operation Characteristic curve (ROC-AUC). While accuracy is easily interpretable, it is not a good metric for a highly imbalanced dataset. On the other hand, ROC-AUC can demonstrate how well the model performs on both the majority and minority data distributions. Furthermore, the Boltzman-Enhanced Discrimination of the Receiver Operating Characteristic (BEDROC) is used as a performance metric [43]. This metric is often used in the molecular domain where datasets are commonly class imbalanced. Recall is used since it reflects how well the model is able to predict active molecules, and misclassifying the few active molecules in the dataset is a costly mistake in the field of drug discovery. For reproducibility purposes, all 182 trained models are provided.
Having acquired the three metrics for ranking models before fine-tuning, they are ranked in three different manners and compared to the ground truth ranking attained from fine-tuning on the target test set. In order to evaluate the ranking prediction, similar to [44], the correlation between the metrics and the improvement in ROC-AUC after transfer learning is calculated. Moreover, the number of accurate predictions between the top 10 models is recorded. Lastly, similar to [41], the Mean Reciprocal Ranking (MRR) is calculated for the predicted ranks, averaging on the top 10 predictions.

Baseline Model Results
The baseline model is trained on the target dataset for 30 epochs. The progress of the model is shown in Figure 2.
Big Data Cogn. Comput. 2020, 4, 16 7 of 20 data distributions. Furthermore, the Boltzman-Enhanced Discrimination of the Receiver Operating Characteristic (BEDROC) is used as a performance metric [43]. This metric is often used in the molecular domain where datasets are commonly class imbalanced. Recall is used since it reflects how well the model is able to predict active molecules, and misclassifying the few active molecules in the dataset is a costly mistake in the field of drug discovery. For reproducibility purposes, all 182 trained models are provided.
Having acquired the three metrics for ranking models before fine-tuning, they are ranked in three different manners and compared to the ground truth ranking attained from fine-tuning on the target test set. In order to evaluate the ranking prediction, similar to [44], the correlation between the metrics and the improvement in ROC-AUC after transfer learning is calculated. Moreover, the number of accurate predictions between the top 10 models is recorded. Lastly, similar to [41], the Mean Reciprocal Ranking (MRR) is calculated for the predicted ranks, averaging on the top 10 predictions.

Baseline Model Results
The baseline model is trained on the target dataset for 30 epochs. The progress of the model is shown in Figure 2.  Table 3.

Transfer Learning Results
182 different source models from 6 datasets are trained for 30 epochs and then fine-tuned using the target dataset. The change in the ROC-AUC of the model on the target test set is depicted in Figure 3. These results can also be seen in further details in Appendix B ( Figure A1).  Table 3.

Transfer Learning Results
182 different source models from 6 datasets are trained for 30 epochs and then fine-tuned using the target dataset. The change in the ROC-AUC of the model on the target test set is depicted in Figure 3. These results can also be seen in further details in Appendix B ( Figure A1). As it is visible from Figure 3, models within the same datasets can either improve or worsen the performance of the target model. The average outcome of these models in regards to the source datasets are shown in Figure 4. The histogram of these results can be found in Appendix B ( Figure A3).  Figure 4 demonstrates that on average, models transferred from the Tox21 dataset tend to perform well on the target task (highest average ROC-AUC), while the best and worst performing models originate from the PCBA dataset. The best performing models from each source task are shown in Table 4. As it is visible from Figure 3, models within the same datasets can either improve or worsen the performance of the target model. The average outcome of these models in regards to the source datasets are shown in Figure 4. The histogram of these results can be found in Appendix B ( Figure A3).  As it is visible from Figure 3, models within the same datasets can either improve or worsen the performance of the target model. The average outcome of these models in regards to the source datasets are shown in Figure 4. The histogram of these results can be found in Appendix B ( Figure A3).  Figure 4 demonstrates that on average, models transferred from the Tox21 dataset tend to perform well on the target task (highest average ROC-AUC), while the best and worst performing models originate from the PCBA dataset. The best performing models from each source task are shown in Table 4.  Figure 4 demonstrates that on average, models transferred from the Tox21 dataset tend to perform well on the target task (highest average ROC-AUC), while the best and worst performing models originate from the PCBA dataset. The best performing models from each source task are shown in Table 4.
The overall best performing model is the model pre-trained on PCBA-651635 dataset and fine-tuned on the target dataset. This model also outperforms the state-of-the-art [18], which uses reinforcement learning from a related task to learn the target task. The best model's confusion matrix is compared to the baseline model in Figure 5, showing noticeable improvement in correctly predicting bio-active molecules after transfer learning. The ROC curves for these two models can be viewed in Appendix B ( Figure A2). The overall best performing model is the model pre-trained on PCBA-651635 dataset and fine-tuned on the target dataset. This model also outperforms the state-of-the-art [18], which uses reinforcement learning from a related task to learn the target task. The best model's confusion matrix is compared to the baseline model in Figure 5, showing noticeable improvement in correctly predicting bio-active molecules after transfer learning. The ROC curves for these two models can be viewed in Appendix B ( Figure A2).

Inter-Dataset Similarity Results
The molecules within each source dataset are compared with those of the target dataset using Jaccard Index. The results are illustrated in Figure 6, showing that the Tox21 and SIDER datasets are the most different data from the target dataset, with PCBA and MUV having high similarities to the target dataset.

Inter-Dataset Similarity Results
The molecules within each source dataset are compared with those of the target dataset using Jaccard Index. The results are illustrated in Figure 6, showing that the Tox21 and SIDER datasets are the most different data from the target dataset, with PCBA and MUV having high similarities to the target dataset.
Big Data Cogn. Comput. 2020, 4, 16 10 of 20 Figure 6. Similarity between each source dataset and the target dataset. Higher Jaccard Index indicates higher degree of similarity between the molecules.

Mean Silhouette Coefficient Results
The target validation dataset is fed to the pre-trained models and deep features are extracted from the second to last layer. The MSC of these features between active and inactive clusters are shown in Figure 6. Similarity between each source dataset and the target dataset. Higher Jaccard Index indicates higher degree of similarity between the molecules.

Mean Silhouette Coefficient Results
The target validation dataset is fed to the pre-trained models and deep features are extracted from the second to last layer. The MSC of these features between active and inactive clusters are shown in Figure 7, demonstrating that on average MUV has a higher capability of distinguishing between the target molecules. Moreover, PCBA contains tasks that possess the best and worst MSC scores. Figure 6. Similarity between each source dataset and the target dataset. Higher Jaccard Index indicates higher degree of similarity between the molecules.

Mean Silhouette Coefficient Results
The target validation dataset is fed to the pre-trained models and deep features are extracted from the second to last layer. The MSC of these features between active and inactive clusters are shown in Figure 7, demonstrating that on average MUV has a higher capability of distinguishing between the target molecules. Moreover, PCBA contains tasks that possess the best and worst MSC scores.

Zero-Shot Inference Results
The target validation set is given to pre-trained models in order to be classified solely based on the knowledge gained from the source data and with no fine-tuning on the target data. The results are shown in Figure 8, demonstrating that MUV, Tox21, and PCBA are able to perform well on average through zero-shot inference.

Zero-Shot Inference Results
The target validation set is given to pre-trained models in order to be classified solely based on the knowledge gained from the source data and with no fine-tuning on the target data. The results are shown in Figure 8, demonstrating that MUV, Tox21, and PCBA are able to perform well on average through zero-shot inference.

Model Rank Prediction Results
After the results are acquired from three ranking approaches, the correlations between the results and the improvements in ROC-AUC of the test set are calculated. Furthermore, the number of correct top 10 predictions and their respective MRRs are calculated and reported in Table 5. IDS and MSC provided ranking predictions that were impractical for our target dataset. Zero-shot inference offers an improvement over previous approaches and can recommend two of the top ten models without performing fine-tuning.

Model Rank Prediction Results
After the results are acquired from three ranking approaches, the correlations between the results and the improvements in ROC-AUC of the test set are calculated. Furthermore, the number of correct top 10 predictions and their respective MRRs are calculated and reported in Table 5. IDS and MSC provided ranking predictions that were impractical for our target dataset. Zero-shot inference offers an improvement over previous approaches and can recommend two of the top ten models without performing fine-tuning.

Discussion and Results Interpretations
The baseline model for the target task shows clear signs of over-fitting at early stages of training, making it a prime candidate for performance improvement via better initialization. After transfer learning implementation, different initializations deliver varying performance and looking at three datasets in particular, enables better interpretation of the results: • PCBA: This dataset is one of the closest and most similar (in terms of fingerprint similarity) to the target dataset. It has the highest MSC, indicating that deep features learned from this data source can distinguish between active and inactive molecules. The best performing data source belongs to this dataset. However, it also possesses the tasks that yielded the lowest MSC and the worst performance. • MUV: This dataset is also very similar to the target dataset. On average the models trained on this dataset delivered the highest MSC. However, on average these models yielded the lowest performance improvement. • Tox21: This dataset is the most dissimilar to the target dataset. It does not perform well when tested with MSC measurement. However, the models from this dataset deliver the highest average improvement after transfer learning.
From the cheminformatics point of view, these results demonstrate the insufficiency of molecular similarity and bio-activity clusterization for a performance's prediction. In a non-structural, non-target-based virtual approach, the molecular data itself plays a very important role since there is no information regarding the target and its 3D structure. Therefore, similarity search and clustering would be a good approach to analyze data and to improve the performance. However, the first interpretation shows that a similarity search can even play an opposite role. In other words, the first interpretation is that judging by the training dataset or deep feature discrimination alone is not enough to understand the behavior of the model. These results demonstrate the fact that similar source dataset (to the target dataset) can perform poorly while dissimilar source datasets can give a high performance on average. Thus, refuting the traditional intuition that the source and target datasets should necessarily be similar. Deep feature discrimination did not prove to be fully capable of explaining the models' behavior either, since models with the highest MSC still could perform rather poorly on the target dataset. This is aligned with the literature, since ranking prediction is still an unsolved task and the prediction accuracy of these approaches varies between datasets and is averaged at 7% [41].
The second interpretation is that zero-shot inference reveals more information from the model and the underlying training data, which in turn delivers a better understanding of the model's behavior. The main difference between the proposed approach and the previous ones is examining how well the model can predict bio-activity of the molecules. Zero-shot inference includes information about target labels and the non-linearity of the last layer of the network into the ranking process, thus examining the model from more aspects. This can be seen in the fact that zero-shot inference offered better understanding of the Tox21 models' behavior and a better perspective for inspecting the model.
It is also notable to mention that none of the three implemented approaches in this work could explain why the MUV dataset which has a high IDS, MSC, and zero-shot inference performance, still delivers the lowest average performance improvement after transfer learning. This fact demonstrates room for improvement in the ranking prediction models for future research. Further limitation of this work is viewing molecules as static inputs and not considering the flexibility of certain compounds.
The collection of pre-trained source models created in this work can be used as diverse initializations for future virtual screening tasks. The zero-shot inference approach provided in this work offers a better understanding of the pre-trained model and a more practical approach to rank the models before fine-tuning. Future directions for this research can include application of transfer learning within all datasets instead of one target dataset, inspection of learning based approaches for ranking prediction, or comparison of pre-trained model's weight distribution to that of the baseline model.

Conclusions
Graph convolutional neural networks have improved the accuracy of virtual screening models, yet face the challenge of imbalanced, non-diverse, and small training datasets. In this work the TranScreen pipeline is designed and implemented to alleviate these challenges with the help of diverse weight initialization. Transfer learning is utilized from 182 source models trained on the MoleculeNet database. The models are then fine-tuned on an anticancer prediction task. The results show that some source models can significantly improve the performance of the baseline target model, with the best model achieving 0.92 ROC-AUC and 100% recall. A collection of the pre-trained models is curated and made available for future virtual screening tasks to be used as weight initialization. Moreover, three approaches are implemented to rank and recommend pre-trained models for a given task, which also gave insight to how the models behave in regards with the training data and feature representations. The code and pre-trained models created in this work are made accessible online through www.transilico.com.

Conflicts of Interest:
The authors declare no conflicts of interest.

Cancer and p53
Cancer is a leading cause of death globally, ranking first or second for deaths in ages below 70 in the majority of countries [45]. This predominant contribution to global mortality, in addition to a significant economic burden [46], places cancer research in a place of paramount importance. In essence, the term "cancer" refers to a family of diseases that arise from abnormal cell growth; this abnormal growth occurs as a result of several cellular changes, usually triggered by mutations in the genome. At their root, many mutations and epigenetic changes can be traced to lifestyle and environmental factors, such as the use of tobacco products, alcohol intake, diet, exercise, and exposure to carcinogens and radiation; still other cancers are the result of inherited mutations and infections. The formation of tumors is a multistep process, characterized by several cellular "hallmarks of cancer", including sustained proliferative signaling, evasion of growth suppressors, resistance to cell death, enabled replicative immortality, induction of angiogenesis, and activation of invasion and metastasis; these characteristics are fueled by both genomic instability and inflammation [47]. The body employs several mechanisms to protect against cancer formation, known as "immunosurveillance", while tumors also evolve to avoid detection and clearance, via immune evasion [48]. The threshold between benign and malignant tumors is defined by migration of the tumor cells to a different location in the body, known as metastasis; this transition involves dedifferentiation of the cells into a stem-like migratory phenotype, and is associated with complication of treatment [49]. Traditional cancer treatment strategies involve surgical removal of tumors, radiotherapy, hormone therapy, and chemotherapy, while newer approaches include immunotherapy and targeted therapies. Targeted therapy differs from traditional chemotherapy in that it focuses on cancer-specific molecules, rather than acting on general cellular processes [50].
One of the ways to view cancer on a cellular level is as an imbalance of oncogenes, which can promote cancer; and tumor-suppressor genes (TSG), which work to prevent it [51]. Mutations in TSGs can lead to inhibition of their normal cancer-surveilling activity, allowing tumorigenesis to go unchecked [52]. One of the most important TSGs is TP53, which encodes the p53 protein, sometimes referred to as the "guardian of the genome" [53]. P53 is activated in response to stressors like DNA damage and deregulated growth, which can lead to cancer if unaddressed; such signals activate sensors like ATM/ATR and ARF, respectively, which then activate p53 via phosphorylation [54]. Once activated, p53 acts as a transcription factor, inducing transcription of genes that facilitate DNA damage repair, entrance into senescence (dormancy), or cell-mediated death (apoptosis), removing the potential for tumor formation [55]. P53's large role in tumor prevention can also be a weakness; cells with mutations in p53 are extremely vulnerable to transformation to a cancer state. Mutations in TP53 frequently interfere with p53's DNA-binding activity in its role as a transcription factor [56]. Current approaches to cancer therapy that target p53 focus on restoration of wild-type p53 functionality, removal of mutant p53, and inhibition of downstream pathways of mutant p53 [57]. Loss of normal function allows damaged cells to proliferate and mutate further, contributing to tumor formation and metastasis [58]. Not all biomolecules enhancing the carcinogenicity after p53 loss of function are yet discovered. Consequently, it has been challenging to discover molecules with unknown target of interest as anticancer. One way to do so would be prediction using non-target-based models [27], which is the main approach taken in this work.                Figure A7. Detailed performance of pre-trained models inferring on target validation set without fine-tuning, results smaller than 0.5 are flipped to be larger than 0.5. (Correlation to improvement: 0.08).