Predicting Essential Proteins Based on Integration of Local Fuzzy Fractal Dimension and Subcellular Location Information

Essential proteins are indispensable to cells’ survival and development. Prediction and analysis of essential proteins are crucial for uncovering the mechanisms of cells. With the help of computer science and high-throughput technologies, forecasting essential proteins by protein–protein interaction (PPI) networks has become more efficient than traditional approaches (expensive experimental methods are generally used). Many computational algorithms were employed to predict the essential proteins; however, they have various restrictions. To improve the prediction accuracy, by introducing the Local Fuzzy Fractal Dimension (LFFD) of complex networks into the analysis of the PPI network, we propose a novel algorithm named LDS, which combines the LFFD of the PPI network with the protein subcellular location information. By testing the proposed LDS algorithm on three different yeast PPI networks, the experimental results show that LDS outperforms some state-of-the-art essential protein-prediction techniques.


Introduction
As one of the important gene products, proteins play a critical role in the lifespan of cells for all living organisms. Essential proteins are those that cause lethality or infertility of a cell if only one of them is removed [1]. Organisms cannot survive without essential proteins [2,3]. Therefore, the prediction of essential proteins is a meaningful task due to its theoretical interest and practical significance.
Up to now, there are generally two kinds of methods used to predict essential proteins. One is the traditional biological experimental techniques, such as gene knockouts [4], RNA interference [5], and conditional knockouts [6]. All of them are expensive and timeconsuming. Another is the computational approaches with the advantage of efficient and low-cost owing to high throughput technologies, such as mass spectrometry analysis [7], yeast two-hybrid system [8,9], and tandem affinity purification [10]. Many computational approaches have been proposed from the network perspective to capture the relations between network features and protein essentiality. If each protein is regarded as a node, the protein-protein interaction (PPI) network can be understood by the concept of a complex network. Complex network-related methods have long been used in PPI networks studies [11][12][13][14][15].
In the current study of the PPI networks, an interesting finding uncovers that highly connected proteins are more likely to be essential ones. This is called the centrality-lethality rule. Accordingly, more and more research efforts focus on the correlations between PPI network topological centrality and protein essentiality. Among them, a wealth of methods have emerged, such as Degree Centrality (DC) [16,17], Subgraph Centrality (SC) [18], Betweenness Centrality (BC) [19], Closeness Centrality (CloseC) [20], Clustering Coefficient (ClusterC) [21], and Information Centrality (IC) [22]. Li et al. [23] proposed a local average Genes 2022, 13, 173 2 of 11 connectivity (LAC) to identify essential proteins. Qi et al. [24] utilized the local interaction density (LID) of the PPI network to predict essential proteins. The above methods provide a new idea for predicting essential proteins. However, due to the high proportion of false positives and false negatives in the PPI networks, they also have certain shortcomings. Taking account of the defect of PPI networks, biological information of proteins should also be considered, including protein complex information, gene expression data, orthologous protein information, subcellular localization information, and so on. Li et al. [25] developed a PeC method that integrates PPI information (edge clustering coefficient) and gene expression profiles (Pearson's correlation coefficient of two interacting proteins) for discovery of essential proteins. Lei et al. [26] designed a weighted PPI network by applying Hyperlink-Induced Top Search (HITS) for essential proteins mining. Ren et al. [27] predicted essential proteins by incorporating PPI networks and protein-complex information. Because essential proteins are usually interconnected, Peng et al. [28] introduced an iterative method for identifying essential proteins based on orthology and PPI networks. Recently, plenty of research has demonstrated that subcellular localization plays a key role in predicting essential protein. Accordingly, Tang et al. [29] proposed a new method by combing the subcellular localization information and PPI data. The experimental results show that it raises the recognition accuracy of essential proteins.
In Ref. [30], Song et al. reported that PPI networks are a fractal network and therefore possesses topological self-similarity [31]. This provides a theoretical basis for predicting essential proteins according to the fractal dimension of the PPI network. A large number of fractal dimension algorithms have been put forward, for instance, box-covering algorithm [32], ball-covering algorithm [33], and edge-covering box-counting algorithm [34], to be used to analyze various complex networks in the real world. However, the algorithms mentioned are all aimed at the global fractal structure of complex networks but ignore the characterization of every node. To make up for this defect, Filipi et al. [35] proposed the local fractal dimension (LFD) of complex networks and apply it to analyze two power grid networks. They found that nodes with high LFD are mostly the topological center of networks.
In this paper, we first develop a new LFD combing with an idea of the fuzzy set, which is called the local fuzzy fractal dimension (LFFD). Compared with the LFD, the LFFD can accurately reflect the role of nodes in the networks. Next, we obtain the subcellular location information of essential and non-essential proteins of Saccharomyces cerevisiae. Then, the subcellular compartment score can be determined using the Bayes formula. Next, combining the LFFD and the subcellular compartment score, we present a so-called LDS algorithm to predict the essential proteins. Three PPI datasets are employed to test our algorithm. On the same datasets, nine existing methods are used for comparison. The result shows that LDS brings the best result.

Local Fractal Dimension
A protein-protein interaction (PPI) network is generally denoted as an undirected network G = (V, E), which is composed of node set V and an edge set E. Each node v ∈ V represents a protein, each edge (u, v) ∈ E represents an interaction between protein u and protein v.
It is widely known that most real-world networks obey the power-law distribution. In Ref. [31], the authors show that the distribution of the PPI network is also according to the power law. According to the power law, Equation (1) holds for the PPI network, where B v (r) is the total number of nodes in the sphere (including the boundary) with center node v and topological radius r. r is taken from 1 to the farthest distance from node v to others. D v is the local fractal dimension (LFD) of node v, and C is constant. The fractal dimension D v can be calculated by the derivatives between the logarithm of B v (r) and r, as follows. In general, one can obtain the D v by calculating the fitting slope of the straight line in the double-log of B v (r) and r.
To visualize this process, we give an example as shown in Figure 1. The center node (red circle) is v, from v to the nodes with r = 1 (dark yellow diamond) and thus B v (1) = 6 (=1 + 5); from v to the nodes with r = 2 (green rectangular) and thus B v (2) = 11 (=6 + 5); from v to the nodes with r = 3 (blue triangle) and thus B v (3) = 15 (=11 + 4); and from v to the nodes with r = 4 (black pentagon) and thus B v (4) = 19 (=15 + 4). As calculated by Equation (2), the value of D v is 0.8295. others. Dv is the local fractal dimension (LFD) of node v, and C is constant. The fractal dimension Dv can be calculated by the derivatives between the logarithm of Bv(r) and r, as follows. In general, one can obtain the Dv by calculating the fitting slope of the straight line in the double-log of Bv(r) and r.

Local Fuzzy Fractal Dimension
In the calculation of the local fractal dimension, the nodes with a topological distance equal to or less than r are considered equally important. However, the distribution of these nodes is usually different and should not be treated equally. The closer to the center node, the greater the contribution to the center node. By this token, the local fractal dimension Dv cannot truly describe the self-similarity of the PPI network. Here, we propose a method to calculate local fuzzy fractal dimension (LFFD) inspired by the concept of fuzzy set. In this method, the Gaussian membership function is employed to distinguish the contribution of different nodes to the center node. The LFFD is defined as where Df(v) denotes the LFFD of node v, Nv(r) is the fuzzy value of the center node v and r is the topological radius. They are determined by where dvj is the shortest distance between node v and node j, Avj (r) is the Gaussian membership function value when dvj is less than or equal to r, and N is the total number of nodes whose shortest distance to the central node v is less than or equal to r. Taking r from 1 to the farthest distance from node v to others in the PPI network, the corresponding Nv(r) is determined by averaging the membership value over the N nodes. Like the calculation process of Dv, Df(v) can be calculated by the fitting slope of the straight line in the log-log plot between the Nv(r) and r.

Local Fuzzy Fractal Dimension
In the calculation of the local fractal dimension, the nodes with a topological distance equal to or less than r are considered equally important. However, the distribution of these nodes is usually different and should not be treated equally. The closer to the center node, the greater the contribution to the center node. By this token, the local fractal dimension D v cannot truly describe the self-similarity of the PPI network. Here, we propose a method to calculate local fuzzy fractal dimension (LFFD) inspired by the concept of fuzzy set. In this method, the Gaussian membership function is employed to distinguish the contribution of different nodes to the center node. The LFFD is defined as where D f (v) denotes the LFFD of node v, N v (r) is the fuzzy value of the center node v and r is the topological radius. They are determined by where d vj is the shortest distance between node v and node j, A vj (r) is the Gaussian membership function value when d vj is less than or equal to r, and N is the total number of nodes whose shortest distance to the central node v is less than or equal to r. Taking r from 1 to the farthest distance from node v to others in the PPI network, the corresponding N v (r) is determined by averaging the membership value over the N nodes. Like the calculation process of D v , D f (v) can be calculated by the fitting slope of the straight line in the log-log plot between the N v (r) and r.
To show this method clearer, we take a well-known kite network as an example. In Figure 2, node 7 is the selected central node, and r is 1 to 4. The calculation of N v (r) is shown as follows.
Genes 2022, 13, x FOR PEER REVIEW 4 of 12 To show this method clearer, we take a well-known kite network as an example. In Figure 2, node 7 is the selected central node, and r is 1 to 4. The calculation of Nv(r) is shown as follows.  Therefore, according to Equation (3), the LFFD of node 7 is 0.2312.

Subcellular Compartment Score
The scholars point out that subcellular location information has been widely exploited in the prediction of essential proteins [36]. We download the subcellular location data of Saccharomyces cerevisiae from the COMPARTMENTS database [37], which is classified into 11 different subcellular compartments, namely Cytoskeleton, Cytosol, Endoplasmic Reticulum, Endosome, Extracellular space, Golgi apparatus, Mitochondrion, Nucleus, Peroxisome, Plasma membrane, and Vacuole. By collecting from MIPS [38], SGD [39], DEG [40], and SGDP, we obtain a list of known 1285 essential proteins and 4394 nonessential proteins of Saccharomyces cerevisiae.
By analyzing the subcellular location data of identified essential and non-essential proteins, we develop a new evaluation strategy to obtain the subcellular compartment score, which is the probability that proteins in a subcellular compartment are potentially essential proteins. Firstly, we calculate the probability that the protein appears at each subcellular compartment in all 5679 (=1285 + 4394) protein data, which is defined as follows: Therefore, according to Equation (3), the LFFD of node 7 is 0.2312.

Subcellular Compartment Score
The scholars point out that subcellular location information has been widely exploited in the prediction of essential proteins [36]. We download the subcellular location data of Saccharomyces cerevisiae from the COMPARTMENTS database [37], which is classified into 11 different subcellular compartments, namely Cytoskeleton, Cytosol, Endoplasmic Reticulum, Endosome, Extracellular space, Golgi apparatus, Mitochondrion, Nucleus, Peroxisome, Plasma membrane, and Vacuole. By collecting from MIPS [38], SGD [39], DEG [40], and SGDP, we obtain a list of known 1285 essential proteins and 4394 nonessential proteins of Saccharomyces cerevisiae.
By analyzing the subcellular location data of identified essential and non-essential proteins, we develop a new evaluation strategy to obtain the subcellular compartment score, which is the probability that proteins in a subcellular compartment are potentially essential proteins. Firstly, we calculate the probability that the protein appears at each subcellular compartment in all 5679 (=1285 + 4394) protein data, which is defined as follows: P(C i ) = P(E)P(C i |E) + P(NE)P(C i |NE) (6) where C i is the subcellular compartment with i from 0 to 10 and P(C i ) is the probability that protein appears at C i . P(E) is the probability of essential proteins in 5679 proteins data, and P(C i |E) is the conditional probability, which indicates the probability that protein appears at C i in 1285 essential proteins. P(NE) is the probability of non-essential proteins in 5679 protein data, and P(C i |NE) indicates the probability that protein appears at C i in 4394 non-essential proteins. Then, the Bayes formula is employed to obtain the subcellular compartment score, where P(E|C i ) is the score of compartment C i , indicating the probability that the protein appearing at C i is an essential protein. According to the above method, the score of 11 subcellular compartments can be calculated. Finally, we count the subcellular compartment score of each protein in the PPI network. For some proteins, we compute the average value in the case of their subcellular location information containing multiple compartments, which is determined by where N is the subcellular compartment number of node v. SCS(v) is the final subcellular compartment score of node v. SCS(v) is set to 0 when the subcellular compartment of node v is null.

LDS Algorithm
The local fuzzy fractal dimension describes the topological feature of the PPI network, while the subcellular location information characterizes the biological information of the PPI network. To comprehensively assess the essentiality of every protein, we combine the above two characteristics to acquire the final value of each protein by using the LDS algorithm. The final value of protein v is defined as LDS(v), which is defined by where ND f (v) is the Min-Max normalization result of D f (v), and α is the parameter within the range (0, 1). If α is equal to 1, the LDS(v) only depends on the topological feature, and the LDS(v) is only determined by the biological information in the case of α = 0. All proteins in the PPI network are ranked in descending order of LDS value.

Experimental Data
As mentioned above, the PPI network of Saccharomyces cerevisiae (yeast) has been widely used in studying essential proteins. In this work, we also use it to perform our experiment. Our PPI datasets were downloaded from the DIP database [41] and the MIPS database. After removing self-interactions and repeated interactions, we constructed three PPI datasets. They are the first dataset DIP4746 with 4746 proteins and 15,166 interactions from the DIP database, the second dataset DIP5093 with 5093 proteins and 24,743 interactions from the DIP database, and the third dataset MIPS4546 with 4546 proteins and 12,319 interactions from the MIPS database, respectively. In addition, we queried the essential and non-essential proteins and subcellular location information in each dataset. For the sake of discussion, we include the unknown proteins as non-essential proteins. More details are listed in Table 1.

Performance of the LDS Algorithm
To demonstrate the performance of the LDS algorithm, we selected the top1000 to top1500 with step size 100 as the essential candidates by ranking proteins in descending order of the LDS value. Then, we checked the candidates with the collection of essential proteins mentioned in Section 2.3. As a comparison, the results obtained from the LDS and other nine traditional prediction methods, namely, DC, SC, BC, CloseC, ClusterC, IC, LAC, PeC, and LID, are shown in Figures 3-5, respectively.   From these figures, some findings can be concluded: (1) The nine compared methods show different performance for the different datasets. For example, the methods LAC and LID outperform other methods on the datasets DIP4746 and DIP5093; however, they have mediocre performance on the dataset MIPS4546. The method PeC has the upper hand on the dataset MIPS4546 but is inferior to most methods over the former two datasets. The performance of the proposed LDS algorithm is quite stable. It showed the best performance for the three considered datasets. (2) Our proposed LDS algorithm performs slightly better for the dataset DIP4746 compared to other methods but is better than the others on the latter two datasets, especially for dataset MIPS4546. These findings suggest that the LDS is more suitable to predict essential proteins due to its high accuracy and robustness.
To further evaluate the performance of the proposed LDS algorithm comprehensively, six evaluation indexes, namely sensitivity (SN), specificity (SP), positive predictive value (PPV), negative predictive value (NPV), F-measure, and accuracy (ACC) are adopted here, defined as in Equations (10)- (15): where TP is the number of essential proteins correctly predicted as essential proteins and TN is the number of non-essential proteins correctly predicted as non-essential proteins. FP is the number of non-essential proteins incorrectly predicted as essential proteins, and FN is the number of essential proteins incorrectly predicted as non-essential proteins.
To assess the effectiveness of the LDS algorithm and the other methods, we select the top1500 of ranking results as essential proteins candidate set while the rest are categorized as non-essential proteins candidate set. The compared results calculated by using the LDS algorithm and the other nine methods on the three datasets are listed in Table 2. We highlight the best result for each dataset. As expected, all the highlighted results come from the LDS algorithm. It is confirmed again that LDS has a distinct advantage over other methods.

Influence of the Parameter α
As shown in Equation (9), the parameter α (∈[0, 1]) is a weight value in the proposed LDS algorithm, which is used to balance the topological structure and biological information. Larger α means that the weight of fractal structure is greater. To illustrate how the α affects the result in the prediction of essential proteins, we changed the α in the range of [0, 1] with step size of 0.1 and redo our experiment reported in Section 3.2. The results are shown in Figure 6. We find that the prediction results depend greatly on α. Specifically, for the datasets DIP4746 and DIP5093, the best results are obtained from α taking 0.4~0.5, which suggests that both topological features and biological information are almost equally important for predicting the essential proteins in those two datasets. However, for the dataset MIPS4546, the optimum α that brings the best result is on the platform of 0~0.2, indicating that biological information is the main factor affecting the prediction of essential proteins. A potential reason for the difference of parameter values may be that Saccharomyces cerevisiae (yeast) datasets downloaded from different protein database websites have distinct topological features. Figure 6. Number of essential proteins predicted by LDS in top1000-1500 for three datasets with different parameter α.

Conclusions
The prediction of essential proteins is an effective way to reveal the molecular mechanisms of cellular life. Based on the combination of the topological feature and biological information of the PPI network, we developed a novel LDS algorithm to predict essential proteins in this research. To investigate the performance of our proposed algorithm, we carried out several experiments on the three PPI datasets. The experiment results on the three datasets of Saccharomyces cerevisiae confirm that the LDS outperforms the other nine existing methods, namely DC, SC, BC, CloseC, ClusterC, IC, LAC, PeC, and LID. Six statistical indicators verify its advantage comprehensively.
In summary, this work is a primary attempt of the leading fractal nature of PPI to the prediction of essential proteins. The results suggest that it is significant to predict essential proteins by feature fusion. In a future study, we will focus on how to merge different features to improve prediction accuracy.