Rule-Based Pruning and In Silico Identification of Essential Proteins in Yeast PPIN

Proteins are vital for the significant cellular activities of living organisms. However, not all of them are essential. Identifying essential proteins through different biological experiments is relatively more laborious and time-consuming than the computational approaches used in recent times. However, practical implementation of conventional scientific methods sometimes becomes challenging due to poor performance impact in specific scenarios. Thus, more developed and efficient computational prediction models are required for essential protein identification. An effective methodology is proposed in this research, capable of predicting essential proteins in a refined yeast protein–protein interaction network (PPIN). The rule-based refinement is done using protein complex and local interaction density information derived from the neighborhood properties of proteins in the network. Identification and pruning of non-essential proteins are equally crucial here. In the initial phase, careful assessment is performed by applying node and edge weights to identify and discard the non-essential proteins from the interaction network. Three cut-off levels are considered for each node and edge weight for pruning the non-essential proteins. Once the PPIN has been filtered out, the second phase starts with two centralities-based approaches: (1) local interaction density (LID) and (2) local interaction density with protein complex (LIDC), which are successively implemented to identify the essential proteins in the yeast PPIN. Our proposed methodology achieves better performance in comparison to the existing state-of-the-art techniques.


Introduction
Various research areas like protein structure prediction [1,2]; protein function prediction using protein sequences [3,4], protein domains [5,6], and protein-protein interaction networks (PPIN) [7][8][9][10][11]; protein subcellular localization identification [12,13]; and detection of essential proteins [14][15][16] have significantly been exploited due to the increase in the availability of a large number of proteins/protein sequences in the post-genomic era. In general, essential proteins are the highly connected modules in a PPIN [17]. So, removing any essential protein from the existing network would be fatal, resulting in various functional disorders of living organisms. Most of the research works [18][19][20] note the fact that deeper analyses of essential proteins in a PPIN will lead to better assimilation of Table 1. Computational studies based on essential protein prediction.

Subcellular localization
An efficient method to identify essential proteins for different species by integrating protein subcellular localization information.
PPIN of Saccharomyces cerevisiae, Homo sapiens, Mus musculus and Drosophila melanogaster [36] Protein complex, degree, subgraph A new method for predicting essential proteins based on participation degree in protein complex and subgraph Density.
PPIN of Saccharomyces cerevisiae [39] CC and orthology United neighborhood closeness centrality and orthology for predicting essential proteins.
PPIN of Saccharomyces cerevisiae [63] Node, edge clustering coefficient Identification of essential proteins using improved node and edge clustering coefficient.

PPIN of Saccharomyces cerevisiae and
Drosophila melanogaster [22] Centrality scores CytoNCA: a cytoscape plugin for centrality analysis and evaluation of protein interaction networks.
_ [24] Protein complex Identification of essential proteins based on a new combination of local interaction density and protein complexes.
PPIN of Saccharomyces cerevisiae [23] PPIN, protein complex Prediction of essential proteins by integration of PPI network topology and protein complex information.
PPIN of Saccharomyces cerevisiae [33] Though the existing computational approaches can identify essential proteins efficiently, these methods produce more false positives. To overcome this, a new methodology for essential protein identification is proposed in this work. This method works in two phases: (1) the first phase deals with the non-essential proteins present in the PPIN using two topological features, node and edge weight [64], which ensure the presence of only the reliable nodes and edges in the PPIN-in other words, they focus only on the densely connected modules in the PPIN [7]. (2) In the next phase, local interaction density (LID) [23] and local interaction density with protein complex (LIDC) [23] are used for the identification of essential proteins in the PPIN. All the required data supporting the proposed methodology, including basic terminologies like node weight, edge weight, LID, and LIDC centralities, are given in the Supplementary Materials, available online: https://drive.google.com/ drive/folders/1nH3bjxTscorRunDOEAnZT2BXzHXWmRKd?usp=sharing, accessed on 18 August 2022.
In the upcoming section, the dataset of Yeast PPIN used for the proposed methodology will be discussed. Following that, the detailed implementation of our rule-based pruning research and the application of LID and LIDC will be highlighted, along with the pictorial representation of PPIN-related terminologies. Finally, the paper will be ended with a results and discussion section, followed by the conclusion.

Dataset
For the proposed work, the PPIN database of yeast, i.e., Saccharomyces cerevisiae, is used. It was downloaded from the DIP database [65,66] (named YDIP_5093 in the work of Luo et al. [23]), which includes 5093 proteins and 24,743 interactions. The PPIN of yeast is highlighted in Figure S1 in Supplementary Materials. Moreover, a protein complex, Cells 2022, 11, 2648 4 of 14 marked as Complex_745 [23], is also used along with LIDC [23] in the second phase of our proposed methodology. It contains about 745 protein complexes involving 2167 proteins. This protein complex is a combination of four natural protein complex datasets: (1) CM270 is obtained from the MIPS database [67]; (2) CM425 [68] is obtained from MIPS (Mewes 2005), Aloy et al. [69], and the SGD database [70]; (3) the last two, CYC408 and CYC428, are obtained from CYC2008 of the Wodak Laboratory [71,72].

Methodology
This section proposes a methodology that identifies proteins as topologically more connected by applying a network-based scoring technique to the processed and rule-based pruned network. The network is pruned by removing some nodes and edges having less node weight and edge weight than the specified cut-off value. Thus, less interconnected proteins are identified based on their degree and other parameters and removed, as they are not very topologically significant. The entire working mechanism of the proposed methodology in this research work is highlighted in Algorithm 1.
The PPIN of yeast contains some topologically less important proteins, i.e., proteins having degree 0 or 1 or fewer interconnections between their neighbors than the rest of the proteins, representing their non-essentiality. Edge reliability is another factor that must be considered for identifying essential proteins. Thus, the reliability of every node and edge is investigated by calculating node and edge weights [64] in the first phase of the proposed methodology. The node weight W v of a node v ∈ V in PPI networks [64] is the average degree of all nodes in G v , a sub-graph of the network G v . It is represented by is the degree of a node u V in W v . The edge weight W uv [64] of nodes u and v is represented by where Γ (u) and Γ (v) are neighbors of u and v, respectively. Γ (u) ∩ Γ (v) represents all common neighbors of u and v, and Γ (u) ∪ Γ (v) means all distinct neighbors of u and v. Less reliable nodes and interconnections are pruned. Thus, in an interaction network, a protein's interconnectivity with other proteins and the reliability of those interactions make the pruning strategy stronger. Moreover, setting various cut-off levels for node and edge weights is integral to this phase. So, three cut-off levels, i.e., high, medium, and low [73] (see Algorithm 1), are evaluated to see the changes in the prediction accuracy level in the second phase of essential protein identification. The cut-off (θ k ) is calculated by the following mathematical equation: where k ∈ {1, 2, 3} defines low, medium, and high cut-offs, respectively. α is determined to be the mean of the node weight/edge weight values, while σ is considered to be the standard deviation of the node weight/edge weight values.
This approach filters out a refined PPIN of yeast containing denser sub-modules [7]. Moreover, as discussed in the introduction, essential proteins tend to lie in the denser sub-modules or protein complexes of a PPIN. Thus, the first phase plays a significant role in this research. The computation of the node and edge weights of two different synthetic networks are highlighted in Figures 1 and 2, respectively.  As discussed in the introduction, computational approaches to essential protein prediction can be of two types: (1) topological centrality-based approaches and (2) heterogeneous feature-based approaches. Experimental data [23] show the topology network centrality-based scoring technique, LID [23], and the heterogeneous feature-based approach, LIDC [23], perform better than the other existing approaches to essential protein identification. So, for each node and edge weight cut-off level in the second phase, LID (Luo and Qi 2015) and LIDC [23] are computed for each protein. LIDC combines heterogeneous values obtained from LID, in-degree centrality of complex (IDC) derived from protein complex Complex_745 [23], and ranking of an individual protein. The procedure for computing LIDC is shown in Figure 3. Finally, the proteins are sorted in descending order according to their computed LIDC values. Protein sets are selected as essential in two different ranking ranges (top 100-200 proteins). This selection strategy is the same as in Luo et al.'s work [23].   As discussed in the introduction, computational approaches to essential protein prediction can be of two types: (1) topological centrality-based approaches and (2) heterogeneous feature-based approaches. Experimental data [23] show the topology network centrality-based scoring technique, LID [23], and the heterogeneous feature-based approach, LIDC [23], perform better than the other existing approaches to essential protein identification. So, for each node and edge weight cut-off level in the second phase, LID (Luo and Qi 2015) and LIDC [23] are computed for each protein. LIDC combines heterogeneous values obtained from LID, in-degree centrality of complex (IDC) derived from protein complex Complex_745 [23], and ranking of an individual protein. The procedure for computing LIDC is shown in Figure 3.

Algorithm 1 (Essential Protein Prediction)
Input: PPIN of yeast Output: List of Essential and Non-essential Protein Begin //calculating node weight for every node P in the network Calculate the node weight, W p = ∑ u∈V (deg(u)) |V | //V is the set of neighbors of node P, and |V | is the number of proteins in V //deg(u) is the degree of a node u ∈ V //end of calculating node weight Compute θ k = α + k × σ × 1 − 1 1+σ 2 // Cut-off calculation of node weight //α is the mean of node weight, σ is the standard deviation of node weight, k ∈ {1, 2, 3} denotes three different //cut-offs, i.e., low, medium, and high, respectively. //reduction of network based on Th k of node weights for every node P in the network if node weight of P < θ k remove P from the network //end of reduction of network based on Th k of node weights //edge weight calculation for every edge E in the network

Result and Discussion
As mentioned earlier, in this proposed work, an LIDC-based scoring technique [23] is used to mark proteins as essential in the topologically processed PPIN, and six different ranking ranges (top 100-600 proteins) are considered. The PPIN of yeast after predicting essential and non-essential proteins at ranking 100 is highlighted in Figure 4. The essentialness of protein sets in the different ranking ranges (top 100-600) at three different cut-offs, i.e., low node and edge weight, medium node and edge weight, and high node and edge weight, are validated against the essential protein set [23] (containing 1285 essential and 4394 non-essential proteins) formed from different databases like MIPS [67], SGD [70], DEG [74], and SGDP [75]. The comparison of the number of predicted essential proteins by our proposed method and several other existing methods like DC [17], BC [26], NC [14], LID [23], PeC [37], CoEWC [59], WDC [61], ION [35], LIDC [23], UC [34], etc. at the three cut-off levels are highlighted in the Supplementary Figures, i.e., Figures S2, S3, and S5-S8. From these figures, it is clear that our method generates an almost equal or greater number of essential proteins compared to LIDC [23] in most cases of the cut-off. This number is comparatively higher when compared to the other methods except for ION. The same observation has also been noted when the jackknife methodology is used to evaluate the proposed method against the others (see Figure 5). Though 20 percent of proteins are considered for evaluating precision, recall, and F-Score, our proposed methodology surpasses the others (see Table 2).

//end of calculation of LIDC
Choose proteins in six ranking ranges (top 100-600) as essential protein sets. End

Result and Discussion
As mentioned earlier, in this proposed work, an LIDC-based scoring technique [23] is used to mark proteins as essential in the topologically processed PPIN, and six different ranking ranges (top 100-600 proteins) are considered. The PPIN of yeast after predicting essential and non-essential proteins at ranking 100 is highlighted in Figure 4. The essentialness of protein sets in the different ranking ranges (top 100-600) at three different cutoffs, i.e., low node and edge weight, medium node and edge weight, and high node and edge weight, are validated against the essential protein set [23] (containing 1285 essential and 4394 non-essential proteins) formed from different databases like MIPS [67], SGD [70], DEG [74], and SGDP [75]. The comparison of the number of predicted essential proteins by our proposed method and several other existing methods like DC [17], BC [26], NC [14], LID [23], PeC [37], CoEWC [59], WDC [61], ION [35], LIDC [23], UC [34], etc. at the three cut-off levels are highlighted in the supplementary figures, i.e., Figures S2, S3, and S5-S8. From these figures, it is clear that our method generates an almost equal or greater number of essential proteins compared to LIDC [23] in most cases of the cut-off. This number is comparatively higher when compared to the other methods except for ION. The same observation has also been noted when the jackknife methodology is used to evaluate the proposed method against the others (see Figure 5). Though 20 percent of proteins are considered for evaluating precision, recall, and F-Score, our proposed methodology surpasses the others (see Table 2).  To compare and validate the performance of the proposed method, the top 20 percent of proteins [23] from the ranking result are selected as essential, while the remaining proteins are designated as non-essential. This selection strategy is the same as in Luo et al.'s work [23]. Precision, recall, and F-score are considered performance evaluation metrics. The performance analysis is highlighted in Table 2. It can be derived from Table 2 that our proposed method performs better than the others in terms of precision, recall, and F-score. This signifies that it succeeds in returning most of the relevant proteins compared to the training set of essential proteins. High precision also indicates a low false positive rate. Removing less important nodes and edges and working on the pruned network makes our proposed method worthy and superior to the methods listed in Table 2 and enables us to get high precision, recall, and F-score values. Our proposed method's satisfactory performance is achieved using node and edge weights with three proper levels of cut-offs. The pruned PPIN network of yeast at ranking 100 is shown in Figure S4 in the supplement. It should also be noted here that though the working mechanisms of LIDC [23] and our proposed method are almost the same, LIDC  To compare and validate the performance of the proposed method, the top 20 percent of proteins [23] from the ranking result are selected as essential, while the remaining proteins are designated as non-essential. This selection strategy is the same as in Luo et al.'s work [23]. Precision, recall, and F-score are considered performance evaluation metrics. The performance analysis is highlighted in Table 2. It can be derived from Table 2 that our proposed method performs better than the others in terms of precision, recall, and F-score. This signifies that it succeeds in returning most of the relevant proteins compared to the training set of essential proteins. High precision also indicates a low false positive rate. Removing less important nodes and edges and working on the pruned network makes our proposed method worthy and superior to the methods listed in Table 2 and enables us to get high precision, recall, and F-score values.
Our proposed method's satisfactory performance is achieved using node and edge weights with three proper levels of cut-offs. The pruned PPIN network of yeast at ranking 100 is shown in Figure S4 in the Supplementary Materials. It should also be noted here that though the working mechanisms of LIDC [23] and our proposed method are almost the same, LIDC [23] is applied to the entire PPIN database of yeast, while our proposed method works on a filtered PPIN generated by using three levels of cut-offs on both node and edge weights. The statistics of predicted essential proteins in a filtered PPIN of yeast at three cut-off levels-low node and edge weight, medium node and edge weight, and high node and edge weight-are displayed in Table 3. The overall precision, recall, and F-score at three levels of cut-offs are shown in Table 4.

Conclusions
Identifying essential proteins is considered one of the most challenging research areas. It helps us identify the significant proteins that are biologically active and play a crucial part in performing vital specific functions of the human body. These proteins might also be essential in transmitting disease or infection when the body is exposed to pathogens. Thus, the computational methods developed for identifying essential proteins should be very effective. PPIN is one of the resources through which this can be done. However, it should be borne in mind that all the network features must be adequately assessed, and the presence of reliable nodes and edges must be ensured. The proposed methodology efficiently identifies essential proteins from a pruned network using local interaction density and local interaction density with a protein complex. The rule-based network pruning is based on specific cut-off edge and node weight values. A detailed comparative study on the performance evaluation of the proposed method and other methods reveals the superiority of this method over others. Because this method solely depends on topological attributes, care should be taken to use a noise-free protein-protein interaction network. This work may be extended to the protein interaction network of any other organism in our future work. However, it should be kept in mind that the essentiality of genes is dynamic. It depends upon the surrounding environment. So, even if several PPIN data of yeast are used for the computational identification of essential proteins/genes, it cannot be assured that the genetic backgrounds set as an experimental environment for all the yeast strains are similar or not [76].
Supplementary Materials: The following supporting information can be downloaded at: https:// www.mdpi.com/article/10.3390/cells11172648/s1. Figure S1: PPIN Network of Yeast of YDIP_5093. It contains 5093 proteins and 24743 interactions; Figure S2: Prediction comparison. Comparison of number of predicted essential proteins for low node and edge weight threshold; Figure S3: Prediction comparison. Comparison of top 100 and top 200 predicted essential proteins for medium node and edge weight threshold; Figure S4: Pruned PPIN of yeast at Low Threshold. Yellow colored nodes are non-essential proteins while the green colored nodes are the essential ones; Figure S5 Institutional Review Board Statement: Not applicable.

Informed Consent Statement: Not applicable.
Data Availability Statement: The source code of the work is available on GitHub at the following link (https://github.com/SovanSaha/Rule-based-pruning-and-In-Silico-identification-of-essentialproteins-in-Yeast-PPIN.git, accessed on 18 August 2022) for free academic use.