Identification of Essential Proteins Based on Improved HITS Algorithm

Essential proteins are critical to the development and survival of cells. Identifying and analyzing essential proteins is vital to understand the molecular mechanisms of living cells and design new drugs. With the development of high-throughput technologies, many protein–protein interaction (PPI) data are available, which facilitates the studies of essential proteins at the network level. Up to now, although various computational methods have been proposed, the prediction precision still needs to be improved. In this paper, we propose a novel method by applying Hyperlink-Induced Topic Search (HITS) on weighted PPI networks to detect essential proteins, named HSEP. First, an original undirected PPI network is transformed into a bidirectional PPI network. Then, both biological information and network topological characteristics are taken into account to weighted PPI networks. Pieces of biological information include gene expression data, Gene Ontology (GO) annotation and subcellular localization. The edge clustering coefficient is represented as network topological characteristics to measure the closeness of two connected nodes. We conducted experiments on two species, namely Saccharomyces cerevisiae and Drosophila melanogaster, and the experimental results show that HSEP outperformed some state-of-the-art essential proteins detection techniques.


Introduction
It is well known that proteins are important for living organisms and are the main components of cellular physiological metabolic pathways. Proteins are involved in various biological processes and carry out almost all cellular functions by interacting with other proteins or DNA. With the development of proteomics in the post-genomic era, several protein-related topics have become the major subject of many studies, including the discovery of protein structures and functions, the identification of essential proteins or protein complexes and functional modules. Notably, removing only one of these essential proteins will cause fatal defects on the organism [1]. In addition, recent studies have shown that essential proteins are related to human disease genes and play significant roles in predicting drug targets [2,3]. Therefore, it is important to identify essential proteins, which will help us to understand the minimum requirements of cell life and find new ways to treat diseases.
To date, much work has been done for predicting essential proteins by biological experiment-based methods and network-based essential proteins discovery methods. Although the tradition experimental methods, such as gene knockouts [4], RNA interference [5] and conditional knockouts [6], can provide an accurate prediction of essential proteins, they are time-consuming and expensive. With the development of high-throughput technologies, such as yeast two-hybrid system [7], mass spectrometry analysis [8], snf tandem affinity purification [9] various protein-protein interaction (PPI) data are available. To break

Hypertext Induced Topic Search Algorithm
Hypertext Induced Topic Search (HITS) algorithm was originally proposed to analyze the importance of web pages and is an iterative algorithm. HITS is a search query dependent algorithm that ranks the web page by processing its entire in-links and out-links. In the HITS algorithm, each page is given two attributes: the hub and the authority. The definition is as follows: Definition 1. Authority. A high quality authority page will be pointed to by many high quality hub pages. The value of the page hub is equal to the sum of the authority values of all the pages it points to. Definition 2. Hub. A high quality hub page points to many high quality authority pages. The page authority value is the sum of all the hub values that point to it.
An example of calculating the value of the hub and authority is shown in Figure 1. Let a(p) and h(p) represent the authority and hub scores of page p, respectively. B(p) and F(p) denote the set of referrer and reference pages of page p, respectively. HITS algorithm can be divided into several steps: (1) Compute a(p) and h(p) in a mutually reinforcing way as follows: (2) Divide the authority of all web pages by the highest authority to normalize it: Divide the hub of all web pages by the highest hub to normalize it: (3) Repeat Step 2 until the difference between the weight in the previous iteration and the weight in the current iteration is less than the set thresholdl the system has entered a stable state and a(u) and h(v) convergence.

Constructing Weighted Protein-Protein Interaction Network
A protein-protein interaction network usually can be expressed as an undirected graph G = (V, E), where the set of vertices V represents proteins, and E represents all of interactions between pairs of proteins. To break up the traditional ideas, we assume that the protein interactions are interacting and convert undirected PPI network G = (V, E) into bidirectional network G = (V, E ) that is equivalent to it. It is worth noting that the transformation from undirected graph to directed graph is a mathematical process, which is not applicable to all biological networks, such as the kinase networks. As there are many false positives and false negatives in high-throughput PPI networks, the prediction accuracy will be affected. To solve this situation, we use the biological information and network topological features to weigh edges separately. According to the HITS algorithm, we assume that nodes with high-quality biological information will be pointed by high-quality topological nodes, and high-quality topological nodes will point to high-quality biological information nodes. In Figure 2, an example is shown to explain the weighted PPI network construction. Network topology weighted edge. In general, Edge Clustering Coefficient (ECC) is usually used to evaluate the tightness of two connected proteins. ECC(u, v) can be defined as follows [28]: where N u and N v denote the set of all neighbors of proteins u and v, respectively; and d u and d v denote the degree of proteins u and v, respectively. The weight from node u to node v is the topological feature ECC. Biological information weighted edge. Gene expression is the process by which information from a gene is used in the synthesis of a functional gene product. Here, we utilize Pearson Correlation Coefficient (PCC), derived from gene expression data, to calculate the importance of related proteins. For gene expression profiles g(u, i) = {g(u, 1), g(u, 2), . . . , g(u, T)} of protein u and g(v, i) = {g(v, 1), g(v, 2), . . . , g(v, T)} of protein v, the PCC is defined as follows: where g(u) and g(v) represent the average gene expression value of profiles u and v, respectively. Next, from the perspective of protein functional similarity, whether there are some common GO annotations between two interacting proteins, the two proteins have the same function and the interaction between proteins becomes strong are analyzed. GO [29] is widely used to represent genes and gene products that span different species. To evaluate the semantic similarity between the GO terms to protein annotations in a PPI network, we adopt the method introduced by Wang et al. [30]: the higher is the value, the stronger is the interaction between proteins: where T u and T v are the annotations of proteins u and v, respectively; S u (t) is the S-value of GO term t related to term u; and S v (t) is the S-value of GO term t related to term v. For most eukaryotes, subcellular compartments produce specific environments that regulate protein biological processes within cells. Subcellular location is divided into 11 different compartments: cytoskeleton, Golgi apparatus, cytosol, endosome, mitochondrion, plasma membrane, nucleus, extracellular space, vacuole, endoplasmic reticulum, and peroxisome. Some studies have shown that, if proteins with two interacting edges are in the same position, the interaction between proteins becomes more reliable [31]. Therefore, we define SL(u, v) as follows to evaluate the connected proteins by subcellular location information: where C denotes the times of edge (u, v) appears in subcellular location, and C max denotes the max times of edge (u, v) appears in subcellular location. The weight from node v to node u is the combination of biological information including PCC, GO_sim(u, v) and SL, which is defined as follows:

Identifying Essential Proteins Based on HSEP Algorithm
Our proposed new algorithm HSEP adopts HITS algorithm based on weighted PPI networks that are constructed in Section 2.2. According to the iteration of the HITS algorithm on the weighted networks, we can obtain the authority value to represent biological information and the hub value to represent topological feature of each protein. To comprehensively evaluate the importance of each protein, we combine the authority value and the hub value to acquire the final score, which can be defined as follows: where α ∈ [0, 1] is used to adjust the proportion of these two scores. If the value of α is equal to 0, the sorting score only depends on the topological information. If the value of α is between 0 and 1, the sorting score is computed based on the biological information and topological feature. According to the definition of HSEP(v), we expect its performance to be affected by different parameters α.
To facilitate the application of HSEP to different organisms to identify the essential proteins and minimize the selection pressure of the parameter α, we adopt an ensemble method introduced by Zhang et al. [32]. For each α ∈ [0, 1] (i = 1,2, . . ., k), we can get an HSEP i (v) for each protein v and its corresponding rank. According to the score of HSEP, we can obtain k ranks of each protein with different k values of α. Based on each ranking HSEP i (v), we select the top n ranked proteins, denoted as X i , and combine them as the total candidates set X. Then, we use ensemble method and majority voting strategy to further predict essential proteins from X. Let EM denote the number of times of protein v appears in the X. If the EM of protein v is greater than the threshold T( k 2 ), then the protein v is considered to be an essential protein. The EM is defined as follows:

Pseudocode of HSEP
The pseudocode of HSEP algorithm is divided into two steps, as shown in Algorithm 1. The first step weighs PPI networks with gene expression data, GO annotation, subcellular localization data, and topological feature with edge clustering coefficient. The second step applies HITS algorithm on weighted PPI networks.

Algorithm 1 HSEP essential proteins identification.
Require: A PPI network G = (V, E),Gene expression data, Subcellular location data Gene Ontology GO. Ensure: Essential protein set.
Step 1 1: Convert G to Bidirectional Digraph G (V, E ) 2: for each interacting protein pair (a, b) in PPI do 3: Calculate ECC /*The closeness of the two nodes*/ 4: Calculate PCC /*the importance of two nodes based on Gene expression */ 5: Calculate GO_sim /*The functional similarity of the two nodes based on GO annotation*/ 6: Calculate SL /*the reliable of two nodes based on subcellular localization */ 7: end for 8: for each interacting protein pair (a,b) in G do 9: edge(a, b)=ECC(a, b) 10: edge(b, a)=PCC(b, a)+GO_sim(b, a)+SL(b, a) 11: end for Step 2 12: for m in [1, maxiter] do 13: for each node v in V do 14:

Results and Discussion
To verify whether our proposed method HSEP is effective for identifying essential proteins, we performed experiments based on Saccharomyces cerevisiae data and Drosophila melanogaster data, and analyzed the influence of parameter on the experiment results. To demonstrate the performance of HSEP, we compared HSEP with a number of existing methods, including DC, EC, IC, SC, NC, LAC, WDC, PeC and UDoNC. Meanwhile, to further evaluate the performance of HSEP, we used some statistical strategies to compare with other methods. In addition, precision-recall curves were used to analyze the influence of different parameter α on the experimental results. Finally, we analyzed the identified essential proteins to further estimate our proposed method HSEP.

Experimental Data
To demonstrate the effectiveness of our proposed method, we performed experiments based on two species: Saccharomyces cerevisiae and Drosophila melanogaster. The Saccharomyces cerevisiae data are widely used for studying essential proteins currently. We applied two sets of Saccharomyces cerevisiae PPI network including DIP database [33] and Gavin database [34]. The PPI network of Drosophila melanogaster was constructed using the HINT database [35], which is a curated compilation of high-quality PPIs from eight interatomic resources (BioGRID, MINT, iRefWeb, DIP, IntAct, HPRD, MIPS and the PDB). After the repeated interactions and the self-connecting interactions, the detailed information is listed in Table 1. The subcellular localization information of proteins were retrieved from knowledge channel of COMPARTMENTS database [36]. There are 5974 proteins and 238,620 subcellular locations, which could be classified into 11 localizations. The gene expression data of Saccharomyces cerevisiae and Drosophila melanogaster were downloaded from GEO database with accession numbers GSE3431 [37] and GSE7763 [38], respectively. GO database is one of the most comprehensive ontology databases in bioinformatics. The GO annotation data of Saccharomyces cerevisiae obtained from GO Consortium [39] and the Drosophila melanogaster GO annotation data were extracted from the COMPARTMENTS database [36]. The list of known essential proteins covers 1285 and 408 essential proteins of Saccharomyces cerevisiae and Drosophila melanogaster, respectively, that were collected from MIPS [40], SGD [41], DEG [42], and SGDP [1].

Comparison with Other Identification Measures
To evaluate the performance of HSEP, we compared HSEP with other competing methods: DC, EC, IC, SC, NC, LAC, WDC, PeC and UDoNC, and selected the top 1%, 5%, 10%, 15%, 20% and 25% proteins as the candidate set. We set α = (0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1), and T = 5. First, to further demonstrate that the HITS algorithm was effective for identifying essential proteins, in terms of biological information, we only used gene expression data to weigh the protein network, named HSP. Then, the comparison of the prediction results with known essential proteins was expressed in terms of histogram, as shown in Figures 3-5, where we can see that the experimental results of HSP are superior to PeC. It indicates that HITS algorithm was effective in identifying essential proteins, since these methods both use gene expression information and ECC to weigh the PPI network. At the same time, HSEP performed better than HSP, which manifests GO annotation and subcellular localization has significant role in identifying essential proteins.
For the DIP dataset shown in Figure 3, our proposed method HSEP clearly performed better than other methods, which indicates that HSEP was effective to identify essential proteins. Especially at the top 1%, 20% and 25%, HSEP method had a more obvious advantage. Taking top 1% (51) as an example, 50 essential proteins were correctly identified by the HSEP while IC, SC and EC correctly predicted 24. At the top 25%, HSEP correctly identified 597 essential proteins, 130 more than SC and EC.
For the Gavin dataset shown in Figure 4, HSEP was slightly better than other eight methods from top 1% to top 25% of ranked proteins. At top 1% (14) level, our proposed method HSEP, LAC and PeC could correctly identify all 14 true essential proteins. The results predicted by HSEP were similar to those obtained using LAC at the top 1%, 10%, 20% and 25% levels. Overall, as shown in Figures 3 and 4, HSEP had more obvious advantages on DIP datasets. Table 1 shows that the density of the Gavin dataset is 3.4 times higher than DIP dataset. We can draw the conclusion that HSEP algorithm was more suitable for dense protein networks on Saccharomyces cerevisiae.
For the HINT dataset shown in Figure 5, HSEP exhibited superior performance compared with the other methods from top 1% to 25% of ranked proteins, and it increased the prediction precision by more than 100%, 26%, 31%, 39%, 26%, and 20% at six levels compared with IC. Comparing Figure 5 with Figures 3 and 4, we can see that Figure 5 presents more obvious advantage, demonstrating our proposed method had better performance on Drosophila melanogaster.

Validation Using Six Statistical Measures
To further evaluate the performance of our proposed HSEP, we adopted several statistical measures, namely sensitivity (SN), specificity (SP), positive predictive value (PPV), negative predictive value (NPV), F-measure (F), and accuracy (ACC), to determine how effectively the essential proteins Were identified by different methods. These statistical measures are defined as follows: where TP is the number of essential proteins correctly identified as essential proteins, FP is the number of nonessential proteins mistakenly identified as essential proteins, TN is the number of nonessential proteins correctly identified as nonessential proteins, and FN is the number of essential proteins mistakenly identified as nonessential proteins. The comparisons of SN, SP, PPV, NPV, F − measure and ACC of HSEP and other methods are shown in Table 2. As shown in Table 2, the HSEP had a better quality than other methods, and we could get similar conclusions with those shown in Figures 3-5.

Influence of Parameter α on HSEP Based on Precision-Recall Curves
To investigate the influence of parameter α on HSEP, precision-recall curves were used to assess the generality of our method. The precision and recall of the top n ranked proteins are defined as follows: where TP(n) is the number of true essential proteins identified correctly, FP(n) is the number of true essential proteins identified incorrectly among the top n proteins, and P is the number of true essential proteins in total. Figure 6 shows the PR curves of HSEP with different parameter α on the DIP database. The higher is the curve, the better is the corresponding metric that distinguishes between the essential protein and the non-essential proteins. As shown in Figure 6, the results were the best when α = 0.7 and α = 0.8. When α = 0, namely only biological information was used, the result was worst. Comprehensively, biological information played a more important role than topological properties in identifying essential proteins.

The Analysis of Essential Proteins
We analyzed the identified essential proteins on DIP database to further substantiate the performance of our proposed HSEP. Figure 7 shows the overall results in terms of the distribution of known essential proteins in PPI network (Figure 7a), the identified 1% essential proteins by DC (Figure 7b) and the identified 1% essential proteins by HSEP (Figure 7c). In Figure 7, we can see that the number of essential proteins correctly identified by DC was 22, shown as yellow circles. Here, we mainly analyzed the 1% identified essential proteins by HSEP. In Figure 7c, we can see that all top 1% essential proteins are connected to form one subnetwork, which shows good topological features and manifests essential proteins perform biological functions as a module that is of significance for identifying protein complexes. In addition, the protein "YHR066W" has a large degree, but is the only one that wasmistakenly identified as an essential protein, indicating that degree cannot fully reflect the essentiality of proteins.

Conclusions
Identifying essential proteins is of great importance for understanding the molecular mechanisms of cellular life. In this study, we have presented a new computational method with HITS algorithm on weighted PPI networks to predict essential proteins. Both biological information and network topology are used to weighted PPI networks, which plays an important role in identifying essential proteins. Meanwhile, we apply an ensemble method to avoid the influence of parameter. To investigate the performance of our proposed algorithm, we carried out a group of simulation experiments on the two species of PPI data: Saccharomyces cerevisiae and Drosophila melanogaster. The experimental results show that HSEP achieved better performance than other methods: DC, EC, IC, SC, NC, LAC, WDC, PeC and UDoNC . To further measure our method, we used six statistical measures to compare with others. In addition, we analyzed the identified essential proteins and they have good topological properties. As future work, our proposed HSEP may be helpful to other studies, such as gene and disease prediction. the identified top 1% essential proteins by DC, where yellow circles are the essential proteins that DC predicted as essential, while aqua circles are the non-essential proteins that DC predicted as essential ones; and (c) the identified top 1% essential proteins of HSEP, where the larger is the degree of the protein, the bigger is the size of the protein.The color key indicates that the degree of protein gradually increases from top to bottom.
Author Contributions: All authors worked on this manuscript together. All authors read and approved the final manuscript.
Funding: This research was founded by the National Natural Science Foundation of China (61672334, 61502290, and 61401263).