Fine-tuning of BERT Model to Accurately Predict Drug–Target Interactions

The identification of optimal drug candidates is very important in drug discovery. Researchers in biology and computational sciences have sought to use machine learning (ML) to efficiently predict drug–target interactions (DTIs). In recent years, according to the emerging usefulness of pretrained models in natural language process (NLPs), pretrained models are being developed for chemical compounds and target proteins. This study sought to improve DTI predictive models using a Bidirectional Encoder Representations from the Transformers (BERT)-pretrained model, ChemBERTa, for chemical compounds. Pretraining features the use of a simplified molecular-input line-entry system (SMILES). We also employ the pretrained ProBERT for target proteins (pretraining employed the amino acid sequences). The BIOSNAP, DAVIS, and BindingDB databases (DBs) were used (alone or together) for learning. The final model, taught by both ChemBERTa and ProtBert and the integrated DBs, afforded the best DTI predictive performance to date based on the receiver operating characteristic area under the curve (AUC) and precision-recall-AUC values compared with previous models. The performance of the final model was verified using a specific case study on 13 pairs of subtrates and the metabolic enzyme cytochrome P450 (CYP). The final model afforded excellent DTI prediction. As the real-world interactions between drugs and target proteins are expected to exhibit specific patterns, pretraining with ChemBERTa and ProtBert could teach such patterns. Learning the patterns of such interactions would enhance DTI accuracy if learning employs large, well-balanced datasets that cover all relationships between drugs and target proteins.


Introduction
Drug discovery is typically expensive, labor-intensive, and inefficient. To overcome these limitations, there is interest in methods that could increase the efficiency of finding compounds that may biochemically interact with specific targets. As knowledge on drugs, their targets, and their interactions accumulate, various computational methods have been developed to predict possible drug-target interactions (DTIs) to aid in drug discovery. In particular, machine learning (ML) methods have demonstrated an encouraging performance in DTI prediction, which used to take drug and protein data as inputs; the drugs are small molecules with a low molecular weight (≤1000 daltons) and may regulate a biological process. Proteins are mostly receptors that receive and transduce signals that may be integrated into biological systems. The model that receives the input data approaches the DTIs as a classification problem and makes predictions using deep learning models, such as Deep Neural Networks (DNNs) [1], Deep Belief Networks (DBNs) [2], Convolutional Neural Networks (CNNs) [3][4][5], and transformer networks [6].
Recently, pretrained language models were found to be optimal for the completion of many natural language processing (NLP) tasks [7]; Transformer-based models have performed impressively in many applications. Inspired by these successes, various transformerbased representation learning approaches have been introduced into the fields of chemical analysis and drug discovery. ChemBERTa [8] and ProtBert [9] are representative pretrained language models for chemical compounds and proteins, respectively; both were constructed using Bidirectional Encoder Representations from Transformers (BERT). ChemBERTa uses simplified molecular-input line-entry system (SMILES)-formatted data to learn chemical structural information, whereas ProtBert attempts to learn protein representations based on amino acid sequences.
In addition to the optimal choice of a pretrained model for DTI prediction, the dataset used to train the ML model requires attention. Basically, datasets that analyze proteinprotein, drug-protein, or drug-drug interactions all have unique characteristics due to differences in the data. For instance, although BIOSNAP, DAVIS, and Binding DB have been widely used, BIOSNAP is a network database that uses a graph structure to describe not only the affinity between drugs and targets but also the activation and inhibition relationships between them; DAVIS is focused on kinases and their ligands. The interaction between 72 kinase inhibitors and 442 kinases covers >80% of the human catalytic protein kinome; binding BD is dedicated to drug-target interaction. The latter DB thus considers not only binding but also Ki, IC50, and Kd values. As model performance depends on the training dataset, an optimal dataset must be chosen for a DTI prediction model.
As no previous effort has been made to verify the utility of a pretrained model in terms of DTI prediction, or the dataset dependency, the main aims of this study were the development of a DTI prediction model pretrained with ChemBERTa and ProtBERT, which analyzes information on chemical compounds and proteins, respectively, and the effective merging of the datasets to ensure accurate DTI prediction; we then compare our model to previous models.

Dataset Configuration
A number of machine-learning-based DTI prediction methods have been proposed [10]. Various DTI datasets have also been introduced for training and testing, including Drug Bank [11], KEGG [12], BIOSNAP [13], DAVIS [14], BindingDB [15], and others. Here, we used the BIOSNAP, DAVIS, and BindingDB datasets after careful consideration of their characteristics. BIOSNAP is a drug-target interaction dataset that contains information about the genes (i.e., those encoding proteins) that are targeted by drugs that are available on the US market, and DAVIS is a dataset focusing on kinases and their inhibitors, which are important proteins in intracellular signaling. Finally, BindingDB is an open dataset that mainly focuses on the interactions between proteins considered candidate drug targets and small druglike ligands and has a wider scope than the other two datasets. BIOSNAP, DAVIS, and BindingDB were used as datasets for the training and evaluation of our method. The BIOSNAP dataset consists of 4510 drug nodes and 2181 protein nodes. From DrugBank, 13,741 DTI pairs were used. Similarly, DAVIS includes information on 68 drugs and 379 proteins, and BindingDB includes 10,665 drugs and 1413 proteins. DAVIS and BindingDB regard an interaction as positive when the K d value of the drug and protein is less than 30 units. The dataset is structured as shown in Table 1.

Model Configuration
Pretraining language models have been confirmed to be effective in many natural language processing tasks. Among the various pretrained models used in natural language processing, the transformer [6] model learns context and meaning by tracking relationships within sequential data, such as words in a sentence. In addition to natural language, this is used to learn the relationships between tokens in the data that are expressed as text combinations. For example, a model can learn chemical information in the form of sentences (SMILES is one such model).
Among the transformer models, BERT [16] has performed impressively in a wide range of applications. Transformer is a model with an encoder-decoder structure, in which attention is used but not recurrence or convolution. It consists of multiple encoder and decoder layers; one encoder layer is composed of two lower layers (self-attention and feed forward layers), and the self-attention layer learns token dependency in the input sequence with the scaled dot-product attention of multihead attention. The feedforward layer is used to transmit the vector output of all tokens that were obtained from the self-attention layer to the next encoder layer as a vector input of the same position.
BERT has a model structure in which multiple encoder layers are stacked. It is pretrained using a Masked Language Model (MLM) and Next Sentence Prediction (NSP). MLM involves the random masking of several tokens in the input sequence and prediction of the masked token from the contextual information provided by the surrounding tokens. General language models predict masked tokens by analyzing previous tokens, while BERT predicts masked tokens by considering all tokens. Therefore, token dependencies are learned from the surrounding tokens. NSP determines whether two input sentences are continuous and learns the dependency between them.
In BERT or BERT-related transformer settings, the vector from a specially prepared token-[CLS]-contains the information of the input sequence obtained from pretraining; [CLS] is a special classification token, which is added in front of every input sequence to express the aggregated sequence information in the classification and regression problem [16]. As with natural language, transformer-based feature embedding and processing techniques were recently introduced for chemical compounds and amino acid sequences [17]. ChemBERTa [8] is based on a RoBERTa [18] model pretrained with large-scale chemical compound data expressed in the SMILES format. ProtBert [9] was developed by pretraining a BERT model using an amino acid sequence dataset to capture useful protein features in vector form.
Here, we used two transformer-based encoders for chemical compounds and amino acid sequences to solve the DTI prediction problem ( Figure 1). One transformer encodes a SMILES input sequence to represent chemical compound information, and the other captures useful protein-related information in vector form from amino acid sequences. The sequencewise information of chemical compounds and protein is captured as a [CLS] vector of the top layer of each transformer, and these vectors are concatenated for further processing to predict DTIs. Arbitrary pretrained SMILES and protein transformers can be plugged into our framework. In this study, we used ChemBERTa and ProtBert to encode chemical compounds and amino acid sequences, respectively.
Pretrained transformer for drugs. To encode drug information in SMILES format, we used ChemBERTa, which is a pretrained RoBERTa model using PubChem [19] subsets, except for the 10 M subset of [8]. The maximum sequence length was 512 tokens with subword-level tokenization. The vocabulary size was 52 K.
Pretrained transformer for proteins. Protein information was encoded using ProtBert, which is a pretrained BERT model using datasets from UniRef [20] and BFD [21] containing up to 393 billion amino acids. In this model, the sequence length varies from 512 to 2000 with character-level tokenization. However, for computational efficiency, we set the maximum sequence length to 545, which covered 95% of the amino acid sequence length distribution. With our settings, an input amino acid sequence longer than 545 will be truncated. The vocabulary size was 30.
Interaction layer. The [CLS] obtained from the output of the last hidden layer is a vector representing the features of the sequence input to the BERT model. In the natural language model, this is used as a token for sequence classification. In our model, it was used to obtain features for a drug and features for a protein. To calculate the interaction prediction probability from the two [CLS] tokens containing this feature information, we first concatenated the two [CLS] tokens to create a single vector, v. Then, a Fully Connected Layer (FCN) was used to calculate the probability of drug and protein binding from the concatenated vector, v. FCN uses a vector as input and performs a linear operation to reduce the vector's dimensions. The vector value of the dimension reduced through three FCNs is finally expressed as output, v out , between 0 and 1 through the activation functions Relu and Tanh. After that, the interaction is indicated by 0 and 1 based on the threshold.
Loss function. The mean squared error (MSE) was used for the regression loss function. More specifically, the SmoothL1Loss function [22] was used, which has a similar calculation to MSE but is less sensitive to outliers.
Our model configuration. The sequence information of compounds and proteins is captured as a [CLS] vector of the last hidden layer in each pretrained transformer. Each piece of captured sequence information is concatenated and input to the interaction layer, and the DTI prediction value is output.

Implementation Settings
Dataset Settings. Table 2 summarizes the training, validation, and test data from BIOSNAP, DAVIS, and BindingDB. The training and test data proportions followed the settings in [23]. The same data were used for accurate comparison with that paper; thus, the model was trained without processing the data further. In addition to MolTrans settings, in this study, we conducted novel experiments with an integration dataset constructed as described in Figure 2. In our integration dataset settings, all the training and validation DTI pairs from BIOSNAP, DAVIS, and BindingDB were put in one basket, and the pairs were fed to the training pipeline as well as being used for validation. However, individual DTI pairs from each dataset were tested in actual experiments. Model Parameters. Our models were implemented with PyTorch. ChemBERTa and ProtBert in huggingface [7] were used for SMILES and amino acid sequence encoding, respectively. ChemBERTa used six layers and ProtBert used only the lower 18 layers. The maximum length of the sequence input to the encoder was 510 for drugs and 545 for proteins, accounting for 95% of the entire dataset. The dropout rate of the interaction layer was 0.1, and the batch size was 32. The optimizer used Adam, and the learning rate was set to 5 × 10 −6 .
Hardware. A single server with an AMD EPYC 7742 64-Core Processor CPU, 1024 GB RAM, and 4 × NVIDIA A100 Tensor Core GPUs was used for training and testing.

Model Performance Evaluation
A performance evaluation was carried out by comparing the proposed model with MolTrans, a DTI-prediction modality using a transformer-based model. The MolTrans dataset was employed to ensure accurate comparison. To confirm the effect of data augmentation on model performance, the training and validation data of three datasets were integrated and used for training. To check the effect of pretraining on performance, the proposed model was divided into four different submodels using the pretrained weights assigned to ChemBERTa and ProtBert. The models were trained on different datasets and the performances were compared; the optimally trained model is indicated in Table 3. The evaluation metrics were the receiver operating characteristic area under the curve (ROC-AUC) and precision-recall-area under the curve (PR-AUC), and sensitivity was used as the main evaluation index. The predictive result is a continuous output expressed in binary form (using a threshold) that is then applied to the metric. To screen for candidate drug-target pairs in the DTI experiment, even if there were some pairs of negative interactions, a higher proportion of positive interaction pairs was better, so sensitivity is an important metric. sensitivity = true positive true positive + f alse positive Table 4 provides the statistics. ROC indicates sensitivity to 1 specificity. A model with a high ROC-AUC will be less likely to rule out pairs with DTI when the threshold for determining whether pairs have DTI is high. The higher the ROC-AUC, the more likely it is to not rule out candidates for pairs with DTI. The PR-Curve is a graph showing the recall precision (sensitivity). Recall refers to the percentage of true positives. A higher PR-AUC means that there are fewer pairs without DTI among those that were initially judged to have DTI when the threshold used to classify them as having DTI is low. In actual experiments with pairs that are likely to have DTI, the higher the PR-AUC, the less probable it is that a pair without DTI will appear. Model performance was compared by evaluating the predictions using the Concordance Index(CI) [24] and r 2 m coefficient [25] employed to evaluate the performance of the continuous output value.
b i and b j are the predictive values for a large affinity δ i and a small affinity δ j , respectively, and δ i > δ j represents the case where the label of the j-th dataset sample is smaller than that of the i-th sample. j is an index from 0 to i − 1. Z is the normalization constant, and b(x) is expressed as a step function.
r 2 m is a metric used for the validation of regression-based quantitative structure-activity relationship (QSAR) models, proposed by [25]

Additional Experiments: Prediction Dissociation Constant
Additional experiments were conducted to visualize the performance of the model. A drug (D) and a target (T) combine to form a drug-target complex (DT). Drugs and targets are partially bound in vivo. The dissociation constant (K d ) is the product of the D and T concentrations, defined using the following chemical equilibrium, and the K d . pK d is the negative logarithm of K d . The higher the pK d , the higher the drug-target affinity.
The model constructed above was modified to directly predict the pK d value; this yielded the dissociation constant. DAVIS and BindingDB datasets were used. A modified MolTrans (with the last sigmoid function removed) and a modified version of our model (with the last function, tanh, removed) were trained on both datasets. The predictive values were graphed.

Model Evaluation Using an External Dataset
To evaluate the model using an external dataset, 13 drug-protein pairs for which the protein/substrate relationships are well-known were selected based on the FDA guidelines on in vitro drug interactions [26]. No report on the interactions between drugs and targets has appeared; we thus explored drug-target interactions using public data, and we employed 13 drug-protein pairs featuring CYP protein and the drugs used by the FDA to evaluate drug interactions. The drugs and proteins are listed in Table 5. For model evaluation, the binding probabilities were predicted for each model (the MolTrans was trained with BindingDB and our models with BindingDB and the integrated dataset that featured pretrained weights) for all 13 drug-protein pairs. The DAVIS and BIOSNAP MolTrans datasets were excluded given their heterogeneity.

DTI prediction performance.
To determine the effectiveness of the proposed models, we measured the ROC-AUC, PR-AUC, sensitivity, and specificity for models with the datasets described above. We compared the proposed method with another deep-learning DTI prediction model [23]. Table 4 shows the effectiveness of the proposed model and the baseline model.
The ROC-AUC and sensitivity scores of our models were high with all types of datasets. With the DAVIS dataset, there was a performance improvement of about 10% in sensitivity. In terms of thePR-AUC, the best performance was observed with the DAVIS and Bind-ingDB datasets, and a comparable performance was achieved using the BIOSNAP dataset and MolTrans.
For CI and r 2 m , the proposed model achieved the highest performance with all datasets. Unlike the binary output metric of BindingDB, the model trained with the BindingDB dataset afforded the highest performance. This appears to be because the model used was based on ROC-AUC, and it is likely that the results are the same as those of the binary output metrics that can be obtained after hyperparameter tuning. After consideration of all the metrics used for performance comparisons, the FP-Model performed the best in most settings.
The DTI prediction performance using the integrated dataset was superior, particularly with DAVIS and BindingDB data. In these experiments, we simply put all training examples from the three different datasets into a single bucket and performed test predictions with each different dataset. Although the data imbalance could not be completely resolved, model performance was improved by increasing the amount of data that were used for training. With the existing statistical method, balanced data are essential; however, in deep learning models, performance can be improved by data augmentation.

Additional Experimental Visualization of the Prediction Dissociation Constant
To further evaluate the performance of the proposed model (as shown in Figure 3), we compared the predicted and actual binding affinity values for the DAVIS and BindingDB datasets. The models used for comparison were the MolTrans and FP-Model trained on each dataset. MolTrans divided the distribution of the predictions into two sets; the BindingDB dataset yielded prediction outputs that exceeded the actual values. The FP-Model distribution was linear by p = m within the range of actual values, indicating that prediction was better than that of MolTrans.
Effects of pretraining. To verify the effectiveness of pretraining, two different experiments were conducted to examine the learning time effect and data size effect. First, we compared the learning time and ROC-AUC across four different models (NP-Model-BDB, CP-Model-BDB, PP-Model-BDB, and FP-Model-BDB) with a fixed dataset (Bind-ingDB dataset).

•
Learning Time. Figure 4a shows that FP-Model-BDB converged much faster and performed better than NP-Model-BDB which did not undergo any pretraining. The NP-Model-BDB took nine epochs (1170 s) to reach a ROC-AUC of 0.8326, whereas the FP-Model-BDB achieved the same performance after only one epoch. The figure indicates that chemical compound encoding pretraining affects the initial training stage, but protein encoding pretraining had more powerful effects in determining model performance at the final stage. • Data-Size. Figure 4b shows that FP-Model-BDB achieved the same performance with fewer data than the NP-Model-BDB. FP-Model-BDB only needed 60% of the total data, whereas the NP-Model-BDB required 80% of the total training data.
Attention Visualization. Another advantage of using a transformer model is that the visualization of internal causal behavior is more feasible than it is with other black-box-like neural network patterns. In this visualization process, attention weights, which play a role in determining the internal causal behavior of the transformer layer, are used. As the name suggests, the attention weight plays a role in determining how much the token value of the previous layer is reflected in the next layer. To visualize these attention weights, attention rollout [27] was used, and, as shown in Figure 5, an input token that had a high influence on the prediction of the model was identified.
Attention rollout is calculated by recursively performing matrix multiplication on the attention weights of layers in the transformer-based model. Through this process, when token embedding information is transmitted from the input layer to the upper layer, the bias of the weight applied to each token increases. By expressing this biased weight as an attention score for input tokens, you can intuitively check the token that had the highest influence on the output calculation.
Based on the visualized attention weights in Figure 5, it is possible to determine which input token has the greatest influence on prediction decisions with SMILES and amino acid sequence inputs into the model. From a biochemical viewpoint, drugs are extremely interactive with proteins at highly electronegative atoms, such as nitrogen (N) or sulfur (S), and the amino acids that interact with drugs also contain highly electronegative atoms. Figure 5 shows that the attention score learned by the model was consistent with human intuition. Knowing which tokens had a greater impact on DTI can help to speed up the drug discovery phase and researchers could find out which functional groups need to be improved.

Discussion
In a study of deep learning models that are useful in biochemistry, Alphafold and Evoformer [28] were used to determine the protein patterns produced by evolution. Evoformer trains to predict the protein structure by receiving input, such as similar amino acid sequences that appear in the process of evolution and the sequences that are to be predicted. From the above case, it can be inferred that there is an amino acid sequence pattern that exists in nature, and it can be learned through self-supervised learning using a transformer. In drug discovery, a molecule that binds to a protein is found by modifying the molecule through a lead optimization process. This is a pattern recognition process that has been performed by humans in the past, and ChemBERTa attempts to do this using a transformer model.
As shown in Figure 4, protein pretraining yielded the best performance. Proteins that exist in nature exhibit a specific pattern because they have survived the process of evolution, and it is assumed that this result was due to the influence of ProtBert, which had learned this pattern.
We confirmed that the type of dataset used to train the model affected the prediction performance. The BIOSNAP dataset had the greatest number of proteins among the three. In addition, the numbers of proteins and drugs were the most similar in this dataset. Therefore, the best PR-AUC results were obtained with this dataset. The DAVIS dataset is biased toward DTI values for tyrosinase (a protein involved in intracellular signaling) and its inhibitors. As a result, the poorest PR-AUC results were obtained with this dataset. There was a high likelihood of overfitting because only the binding probability for a specific protein and drug was learned. BindingDB contained the greatest number of interaction pairs among the three datasets. However, compared with DAVIS, the number of proteins was insufficient. Therefore, the PR-AUC values obtained with this dataset were lower than those obtained with the DAVIS dataset.
In this experiment, as for the model trained with the DAVIS dataset, two methods were used to prevent overfitting (given the small number of datasets). We confirmed that model performance improved on data augmentation; we integrated the pretrained models. In particular, the pretrained model was trained with large amounts of the SMILES and amino acid sequence datasets; it was thus possible to supplement training (which might have been inadequate using the drug-target pairs dataset alone). We employed the Masked Language Model (MLM) for pretraining. This pretraining differed from DTI prediction. When a model that has undergone pretraining engages in transfer learning, training is faster (fewer epochs) than that of a model trained without pretraining employing the same data; we confirmed that the final performance of the pretrained model was higher than that of the model lacking pretraining (Figure 4a). Thus, a language model pretrained with a rich dataset can increase the performance of a model trained using a small dataset. One of the most important effects of pretraining is generalization. An overfitted model exhibits high performance using a specific dataset but a poor performance on other datasets. After pretraining, we confirmed that our model performed better on datasets other than the reference dataset. Given the generalization effect, our model performed better on BindingDB (which contains the most diverse data) than other models. As a model affording good generalization is more robust to external data, this model more accurately selects the priority of future DTI experiments.
We compared the binding probabilities (using an external validation dataset) between well-understood drug-protein pairs. We found that pretraining and learned dataset expansion improved drug-target interaction predictions and could also find applications in other fields, such as natural language processes. Although the results are not shown, the MolTrans outputs varied by the learned dataset; this was not true of our model (which employed integrated data). Despite the promising results acquired with the external dataset, we used only 13 drug/protein pairs; it is difficult to acquire data that might validate the DB. More external verification data are required.
In terms of the visualization afforded by the additional experiment (Prediction Dissociation Constant), further training of the pretrained model rendered the predictions more continuous, confirming the effect of the above-mentioned pretraining. Our model was pretrained using a large dataset that lacked pK d information. After pretraining, knowledge of SMILES and amino acid sequences is included in the model; we presume that this previously learned knowledge enabled the continuous prediction of even pK d .

Conclusions
When identifying screening hits during drug discovery, this model can determine the drugs that should be prioritized for drug-target interaction experiments. During the lead optimization of a drug, it is possible to predict affinity before synthesizing the drug, although high affinity does not always indicate that a drug is effective. This method can be used to find drugs with similar structures but slightly different affinities. Learning the patterns of various proteins is thought to be important in the screening process for drug discovery. It was also confirmed that the PR-AUC would improve as the number of drugs approached the number of proteins. The PR-AUC plays an important role in filtering out the drug-target pairs that will fail during screening. Therefore, this is an important metric for reducing cost when performing large-scale DTI experiments.