Computational Prediction and Analysis of Associations between Small Molecules and Binding-Associated S-Nitrosylation Sites

Interactions between drugs and proteins occupy a central position during the process of drug discovery and development. Numerous methods have recently been developed for identifying drug–target interactions, but few have been devoted to finding interactions between post-translationally modified proteins and drugs. We presented a machine learning-based method for identifying associations between small molecules and binding-associated S-nitrosylated (SNO-) proteins. Namely, small molecules were encoded by molecular fingerprint, SNO-proteins were encoded by the information entropy-based method, and the random forest was used to train a classifier. Ten-fold and leave-one-out cross validations achieved, respectively, 0.7235 and 0.7490 of the area under a receiver operating characteristic curve. Computational analysis of similarity suggested that SNO-proteins associated with the same drug shared statistically significant similarity, and vice versa. This method and finding are useful to identify drug–SNO associations and further facilitate the discovery and development of SNO-associated drugs.


Introduction
Disease is one of the most serious threats to the safety of lives. In 2006 alone, for example, 1,685,210 new cancer cases and 595,690 cancer deaths took place in the United States, and cancer was rising to become the second-leading cause of death [1]. Although very complicated, the cause for this phenomenon was grouped into two main factors. For one thing, the disease's pathological process is very complex. Environmental, epigenetic, and lifestyle factors jointly determine the development and progress of disease [2]. The cause of the poorly understood Alzheimer's disease, for example, is believed to be a combination of genetics and history of head injuries, depression, or hypertension [3]. For another, the discovery and development of new drugs could not match the requirement of treating the disease at all, because it is time-consuming and labor-intensive. The cost and the time of developing a new drug was conservatively estimated at an average of more than $ 800 million and at an average of 12 to 15 years, respectively [4]. For drug discovery and development, the first important thing is to find drug targets related to diseases.
Although a large number of computational methods have been proposed to predict drug-target interactions in the past decades [5][6][7][8][9][10][11][12][13][14][15][16][17][18][19], few were dedicated to such specific types of proteins as post-translationally modified proteins and membrane proteins. Because most of the identified targets were not associated with specific diseases, this limited the discovery and development of drugs to a certain extent.
S-nitrosylation (SNO) is a type of post-translational modification where a nitric oxide is covalently attached to a cysteine residue in proteins. In addition to regulation of protein structure and functions [20], SNO plays an important role both in normal health and in a wide range of human diseases [21,22]. For example, aberrant protein SNO was implicated especially in neurodegenerative diseases, including Alzheimer's and Parkinson's diseases [23][24][25][26]. Protein SNO was also reported to be associated with some cancers, including lung cancer [27,28]. Accumulating evidence suggested that SNO-protein is a therapeutic target for some degenerative diseases and cancers [27,29]. This provides a new idea to treat these diseases: to interfere with diseases through the specific drugs to mediate relevant SNO proteins. In the paper, we propose a computational method to identify drug-SNO associations.

Experimental Data
Associations between drugs and binding-associated SNO sites were downloaded from the dbPTM database (http://dbPTM.mbc.nctu.edu.tw/), an experimentally validated post-translational modification (PTM) database [30,31]. The PTM sites whose side chains are located within 10 Å of a drug-binding SNO site were viewed as drug binding-associated PTMs [30]. We collected 147 pairs of associations between drugs and binding-associated SNO sites, including 38 such SNO sites located in 36 proteins and 125 drugs. All these 147 pairs of associations were regarded as positive samples. We randomly paired 38 SNO sites and 125 drugs. Any pairs that were not identical to positive samples served as negative ones, where SNO sites were supposed not to be associated with the corresponding drug. We yielded 478 negative samples, of which 147, along with the 147 positive samples, composed the training set (see Supplementary Material S1). 36 SNO-protein sequences were retrieved from the Uniprot database, a comprehensive repository of protein sequences and function annotation [32][33][34][35][36]. SNO-protein sequences were cut into peptides with 21 amino acid residues and centering the SNO-modified cysteine. If peptides with the SNO-modified cysteine were of less than 21 residues, we added one "X" character or more to reach it.

Method
As shown in Figure 1, the proposed approach consisted mainly of three steps: encoding, training, and predicting. Peptides and small molecules of drugs were encoded into features by information entropy and molecular fingerprints, respectively. Then, the random forest algorithm [37] was used to train the classifier over the training set. For unknown drug-SNO associations, the trained classifier predicted and outputted the answers "yes" or "no".

Encoding Small Molecules
Molecular fingerprinting is an important concept in the area of chemoinformatics and bioinformatics, which has been initiated for chemical substructure search [38], and later, widely used for molecular similarity comparison [39], clustering [40], and so forth. Fingerprinting of small molecules is generally a binary string representation. There are generally two main categories: structure-based and hash-based fingerprints. For the former, each bit represents the presence or absence of a specific substructure; namely, 1 standing for presence and 0 for absence, as illustrated in Figure 2A. Different structure-based fingerprint systems depict different aspects of molecular substructure. Electrotopological state (E-state) fingerprints represent the presence/absence of the 79 E-state substructures defined by Kier and Hall in a molecule [41]. The pubchem fingerprint is an 881-bit binary string, covering a wide range of different substructures and features [42,43]. The MACCS fingerprint [44] depicts 166 key substructures including ISOTOPE, ON(C)C, C$=C($A)$A and C=C(C)C (https://list.indiana.edu/sympa/arc/chminf-l/2007-11/msg00058.html), commonly involving most key chemical features for drug discovery and virtual screening [45]. The chemistry development kit (CDK) substructure fingerprint represents information on the presence of 307 substructures. The CDK extended fingerprint [46] extends the CDK fingerprint with additional bits describing ring features. The CDK graph-only fingerprint is a specialized replacement of the CDK fingerprint which does not consider bond orders [46], and Klekota-Roth fingerprints represent the presence/absence of 4860 substructures. The Daylight fingerprint [45] is a type of hash-based fingerprint, which analyzed the molecular fragments of all paths of up to a user-defined number of bonds, and then hashed every one of these paths, as illustrated in Figure 2B. The CDK Daylight fingerprint hashed each atom type, all augmented atoms, and all paths of 2-7 atoms into a 1024-bit binary string. The CDK hybridization fingerprint is a combination of different fingerprints. All fingerprints mentioned above were calculated by the CDK software [47,48] in the webserver ChemDes, a platform providing multiple software in an online manner for computing molecular descriptors and fingerprints [49]. Table 1 listed these fingerprints and the numbers of bits.

Encoding Protein Peptides
In the field of bioinformatics, there are numerous ways of encoding biological sequences [50][51][52][53][54][55][56] such as the widely used amino acid composition [52], feature vector [53], pseudo amino acid composition [54], and amino acidphysicochemical properties [55], etc. We used the information entropy to encode each peptide. The information entropy of amino acid (IEA) was computed by: where λ represents one of 21 kinds of amino acid and P i (λ) denotes the probability of occurring at the ith position for the amino acid λ. Equation (1) measures the information on the distribution of position for the specific amino acid. Similarly, the information entropy of the position (IEP)was denoted by: where Ω denotes the set of 21 amino acids. The IEP measures information on the probability of an amino acid at a specific position. All these 38 protein peptides in the training set were used to estimate the probabilities both of positions for a specific amino acid and of an amino acid at a specific position. For each peptide, the information entropy over all the peptides minus that over the whole set subtracting this peptide was used to encode it. Because all the residues at the 11th position were always cysteine, its information entropy was 0 and was removed. Finally, we obtained 41 features to depict the peptides, each feature r i,j normalized by: The 41 features represent 21 and 20 information entropies of amino acids and of position, respectively.

Random Forest
Random forest, which combines bagging and random selection of features, is a type of ensemble learning algorithm [37]. The random forest was described by three key steps: (1) sample with replacement n training examples; (2) randomly select k features; (3) construct a decision tree using n training examples with k features. Repeating the above three steps m times generated m decision trees. For a classification task, the output of testing sample x is majority vote of m decision trees. Due to advantages such as cheap computation time, ability to deal with high-dimensional data, and better performance, the random forest has been widely applied to classification and regression [57][58][59][60].

Cross Validation and Metrics
We used ten-fold and leave-one-out cross validations to test the proposed method. In the ten-fold cross validation, the training set was randomly grouped into 10 parts of equal or almost-equal size. Each part was in turn taken as testing samples by the trained classifier over the other nine parts.
This procedure was repeated ten times. The leave-one-out test is an extreme case of the cross-validation test, where each sample is viewed as an independent part. We used the receiver operating characteristic curve (also known as the ROC curve) to depict experimental performances. The ROC curve could be drawn by plotting the true positive rate against the false positive rate at various thresholds. The area under the ROC curve (AUC) was used to quantitatively assess the experimental performance. The AUC ranges from 0 to 1, with 1 and 0.5 representing the best and uninformative performances, respectively.

Computational Environment
All computations were performed on a Microsoft Windows operating system with a 64-bit version (Windows 10) on an Acer personal computer with an Intel ® Core™ i5-3210M CPU and 6.0 G RAM.

Results and Discussion
The ROC curves of the ten-fold cross validation for nine types of fingerprints were drawn in Figure 3. The MACCS fingerprints performed best, followed by substructures, and then by PubChem fingerprints. Their AUCs were 0.7312, 0.6943, and 0.6508, respectively. The combination of nine fingerprints and the hybridization fingerprint performed worst and second-worst, with their AUCs of 0.3012 and 0.4767, respectively. We repeated the ten-fold cross validation ten times. The mean value is 0.7235. Supposing that AUC was the normal distribution with unknown variance values, the 95% confidence interval of the mean AUC was estimated at [0.7160, 0.7311]. To answer the question of if the random forest is a better algorithm for predicting drug-SNO associations, we further compared it with state-of-the-art learning algorithms: C4.5 [61], the naive Bayes classifier [62], the radial basis function network [63], and Bagging [64]. C4.5 is a type of decision tree algorithm [61]. In 2008, C4.5 ranked first in the top 10 data mining algorithms identified by the IEEE International Conference on Data Mining [65]. The naive Bayes classifier [62] is a Bayes' theorem-based statistical learning model, where variables are supposed to be independent of each other and which makes decisions commonly by computing post-probabilities of samples, given specific classes. The radial basis function network [63] is a specific artificial neural network with radial basis functions as the specific activation function. Bagging [64] is an ensemble learning algorithm, which generally makes decisions by voting over some constituent subclassifiers. Due to excellent performance, these four algorithms have widely been applied to such fields as function approximation and classification. Here, they were used as baseline algorithms for comparison. As shown in Figure 4, the random forest performed best in the leave-one-out cross validation, being 0.1 more than Bagging in terms of AUC.

Computational Analysis of Associations between Drugs and SNO-Proteins
The common hypothesis in the field of molecular biology is the guide by association (GBA) principle [66]. In order to test whether this hypothesis was applicable to associations between drugs and SNO-proteins, we computed the drug-drug and the protein-protein similarities and compared them. Sequence similarity is a basic concept in the field of molecular biology, on which many hypotheses are based. For example, similar sequences are assumed to possess similar structures, which are in turn assumed to have similar functions. We used sequence similarity to measure protein-protein similarity. Each protein sequence was aligned against the whole 38 protein peptides by the PSI-BLAST program [67]. The matrix (P i,j ) 38 × 38 denoted the alignment score. Each column was normalized by: , i = 1, 2, · · · , 38, j = 1, 2, · · · , 38 To keep the similarity matrix symmetrical, let: Drug-drug similarity was computed by the Tanimoto coefficient [68], namely: where |A| denotes the number of shared fingerprints, and d 1 and d 2 represent fingerprints of the drugs D 1 and D 2 , respectively. To keep drug-drug similarity sensible, Tc of less than 0.6 was set to 0. To test the hypothesis that similar proteins would be associated with similar drugs, we computed similarity PPS d among proteins {p 1 , p 2 , · · · p k } associated with the same drug by: The similarity PPS d c among other proteins not being associated with the drug d was used as the control. We used a permutation test to examine the above hypothesis. The p-value was 0.0020, statistically suggesting that similar SNO-proteins would be associated with the same drugs. Figure 5A demonstrated such a case: two samples P15121-304 and P15121-299 in the protein aldose reductase shared most amino acid peptides, and they were associated with the same drugs: DB02383, DB07030, DB07093, DB07139, DB08084, DB08449, and DB08772 (DBXXXXX is the identifier of a drug in the Drugbank database, which is a unique bioinformatics and cheminformatics resource [69,70]). Similarly, to test hypothesis that similar drugs would be associated with similar SNO-proteins, we computed similarity DDS p among drugs {d 1 , d 2 , · · · d k } being associated with the same protein p by: The similarity DDS p c among other drugs not being linked to the SNO-protein p was used as the control. The p-value in the permutation test is 0.0077, statistically implying that similar drugs would be associated with the same SNO-proteins. As shown in Figure 5B, both drugs phosphoaminophosphonic acid guanylate ester (DB02082) and guanosine-5 -diphosphate (DB04315), which were associated with the same SNO-protein, P63000-178, were very similar in two-dimensional (2D) structure.

Large-Scale Prediction of Unknown Associations of Drugs and SNO-Proteins
In total, 147 of the randomly yielded negative samples in the section "Experimental data" served as negative training samples, and the remaining 331 were used as the testing samples (see Supplementary Materials S2). Of the 331 random associations, 213 were predicted to be negative by the trained classifier over the training set; its accuracy being viewed as 213/331 = 0.6435. Of 118 predicted positive samples, 12 were with the output of probability of 1, implying that they were potential positive samples. We intended to look for evidence to support the prediction by similarity analysis. Next, we investigated SNO-protein targets of small molecules similar to the studied small molecules (i.e., drugs in the column in Table 2). All the small molecules, except DB07905, have similar drugs which all respectively shared the same SNO-protein as the analogue, as shown in Table 2. For example, the association between the small molecule 3-[[N-[4-methyl-piperazinyl]carbonyl]-phenylalaninyl-amino]-5-phenyl-pentane-1-sulfonic acid benzyloxy-amide (DB04427) and the tyrosine-protein phosphatase nonreceptor type 1 (P18031) protein was predicted. The small molecule DB04427 was similar to these small molecules: DB06887, DB07719, DB08003, DB08549, DB08591, DB08593, and DB02827, which all targeted the SNO-protein P18031. Thus, it seemed a rational to identify the association of DB04427 with P18031. The protein aldose reductase (P15121) catalyzes the NADPH-dependent reduction of a wide variety of carbonyl-containing compounds to their corresponding alcohols with a broad range of catalytic efficiencies [32,33]. The small molecule 7N-methyl-8-hydroguanosine-5 -diphosphate (DB01960) belongs to a type of organic compounds known as purine ribonucleoside diphosphates. The fact that the small molecule DB01960 was similar to the molecules DB02338 and DB03461, which target the SNO-protein P15121, supported predicted DB01960-P15121 associations to a certain extent. The SNO-protein cathepsin K (P43235) was closely involved in osteoclastic bone resorption and may participate partially in the disorder of bone remodeling. The small molecule L-citrulline (DB00155) was predicted to bind the SNO-protein cathepsin K. L-citrulline shares structural similarity with the small molecules DB04276, DB04523, and DB07592, which were associated with P43235. Thus, it seemed a natural inference to predict the DB00155-P43235 association. We indirectly explained the rationality of the above predicted SNO-drug associations in terms of molecular similarity. Due to limitations of conditions, we did not conduct wet experiments to validate these associations, which will be left for experimental biologists to validate in the coming future.

Discussion
For drug discovery and development, it is one of most important things to find protein targets of drugs. Although a large number of approaches have been proposed to detect new targets of drugs in the past decades [5][6][7]10,11,14,[71][72][73][74][75][76][77][78], few have dealt with specific proteins such as post-translationally modified proteins which play key regulating roles in cellular activities [79,80]. Protein SNO is involved in the pathological progression of some diseases, especially neurodegenerative diseases. Therefore, identifying associations between SNO-proteins and small molecules is helpful, especially to develop and discover new drugs treating SNO-mediated diseases. For the first time, we have developed a computational method to predict associations between SNO-proteins and small molecules. This method achieved the expected performances on the experimental dataset. We compared the contributions of various fingerprints of molecules to the recognition of associations between drugs and SNO-proteins. Of all the compared fingerprints, the MACCS obtained the best performances (Figure 3), while the combined fingerprints and the hybridization performed worst and second-worst, respectively ( Figure 3). Different fingerprints contributed differently, suggesting specificity concerning protein-binding structure. The reason why MACCS fingerprints perform better than the others is unknown, but it is clear that there would be a certain key substructure associated closely with SNO-proteins. Using this fingerprint would promote the performance of predicting associations between drugs and SNO-proteins. Flooding of the informative fingerprints by a large number of irrelative fingerprints would explain why the combination of all fingerprints performed the worst.
In the areas of bioinformatics, there is a common hypothesis called the GBA principle [66]. The computational analysis of drug-drug similarities and protein-protein similarities supported this hypothesis. This provides an idea of finding an SNO-protein-associated drug or a drug-associated SNO-protein from its analogues. However, there are some exceptions observed in the training set. For example, the compounds phosphoaminophosphonic acid guanylate ester (DB02082) and 7N-methyl-8-hydroguanosine-5 -diphosphate (DB01960) have a similarity value of 0.8358, but they were associated with different SNO-proteins. Namely, DB01960 is associated with the proteins eukaryotic translation initiation factor 4E type 3 (Q9DBB5), and DB02082 with Ras-related C3 botulinum toxin substrate 1 (P63000). The binding of the drug to the target might be subject to key substructures. Although highly similar, the small molecules DB01960 and DB02082 might lack common key substructures to bind proteins.

Conclusions
SNO-proteins are involved in the pathological processes of some diseases. There is no doubt that the identification of drug-SNO associations is helpful to discover and develop new drugs to treat SNO-mediated disease. We, for the first time, present a machine learning-based computational method to predict associations between drugs and SNO-proteins. The method is simple to implement but quite efficient. We statistically showed that SNO-proteins associated with the same drug would share high similarity, and vice versa. The finding would be useful in detecting drug-associated SNO-proteins and SNO-protein-associated drugs by the similarity search.