Next Article in Journal
Uncovering Insights for New Car Recommendations with Sequence Pattern Mining on Mobile Applications
Next Article in Special Issue
Closely-Spaced Repetitions of CAMTA Trans-Factor Binding Sites in Promoters of Model Plant MEP Pathway Genes
Previous Article in Journal
Research on Red Jujubes Recognition Based on a Convolutional Neural Network
Previous Article in Special Issue
Plant Exosomal Vesicles: Perspective Information Nanocarriers in Biomedicine
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A Special Structural Based Weighted Network Approach for the Analysis of Protein Complexes

1
Institute of Informatics, University of Szeged, 2 Árpád tér, H-6720 Szeged, Hungary
2
Bánki Donát Faculty of Mechanical and Safety Engineering, Óbuda University, Népszínház Street 8, H-1081 Budapest, Hungary
3
Department of Pediatrics and Pediatric Health Center, Albert Szent-Györgyi Health Centre, University of Szeged, H-6725 Szeged, Hungary
4
InnoRenew CoE, Livade 6a, 6310 Izola, Slovenia
5
Andrej Marušič Institute, University of Primorska, Muzejski trg 2, 6000 Koper, Slovenia
6
Department of Applied Informatics, University of Szeged, Boldogasszony sgt. 6, H-6725 Szeged, Hungary
*
Authors to whom correspondence should be addressed.
Appl. Sci. 2023, 13(11), 6388; https://doi.org/10.3390/app13116388
Submission received: 20 April 2023 / Revised: 10 May 2023 / Accepted: 18 May 2023 / Published: 23 May 2023
(This article belongs to the Special Issue Bioinformatics: From Gene to Networks)

Abstract

:
The detection and analysis of protein complexes is essential for understanding the functional mechanism and cellular integrity. Recently, several techniques for detecting and analysing protein complexes from Protein–Protein Interaction (PPI) dataset have been developed. Most of those techniques are inefficient in terms of detecting, overlapping complexes, exclusion of attachment protein in complex core, inability to detect inherent structures of underlying complexes, have high false-positive rates and an enrichment analysis. To address these limitations, we introduce a special structural-based weighted network approach for the analysis of protein complexes based on a Weighted Edge, Core-Attachment and Local Modularity structures (WECALM). Experimental results indicate that WECALM performs relatively better than existing algorithms in terms of accuracy, computational time, and p-value. A functional enrichment analysis also shows that WECALM is able to identify a large number of biologically significant protein complexes. Overall, WECALM outperforms other approaches by striking a better balance of accuracy and efficiency in the detection of protein complexes.

1. Introduction

The detection of protein complexes in Protein–Protein Interaction (PPI) networks is an essential task in system biology for deciphering the cellular organization and functional mechanism. Protein complexes perform the majority of a cell’s functional actions [1,2,3]. As a result, detecting protein complexes is a critical research topic in systems biology. Understanding biological processes is also important in a variety of cytoplasmic systems and helps in the diagnosis of complex diseases [4,5,6].
Though there are numerous laboratory techniques for detecting protein complexes, most of them tend to be expensive and time-consuming. This has led to the use of computational methods as an efficient approach to detect protein complexes [7]. Computational methods for protein complex detection are generally classified into two broad classes depending on the information required during the complex detection procedure [8]. The first class is known as a topology-based approach, which just uses PPI network topological information to detect protein complexes. The second class uses both topological and biological data to detect protein complexes such as DPC [9], GMFTP [10], and IPC-BSS [11]. Recently, a number of topology-based approaches have been developed to detect protein complexes. For example, there is k-cliques or cliques-based method such as CMC [12] and CFinder [13]; Sub-network density-based methods such as MCL [14,15,16], DPClus [17,18], and SPICi [19]; modularity-based method such as CALM [20] and ClusterONE [21]; core-attachment structure-based methods such as COACH [22] and Core [23] and rank and spoke-based methods such as ProRank+ [24].
Nevertheless, these topological-based methods do not identify the state and structure of protein complexes in a PPI network. For instance, CFinder [13], detects protein complexes based on the clique percolation method (CPM) [25], an approach which is computationally expensive when handling large-scale PPI networks due to the NP-complete problem that requires protein complex to be k-clique [26,27]. Related studies have also applied a sub-network density-based approach such as Markov Clustering (MCL) [15,16], and tend to detect protein complexes based on the interaction of proteins within a sub-network (protein complex) in a random walks fashion [7,8,28,29]. Moreover, a heuristic network clustering technique such as SPICi [19], has shown to be efficient for detecting protein complexes based on the local density and support measure. However, this technique is often unreliable when it comes to the detection of protein complexes with overlapping structures especially with high functional similarity. This has led to the development of DPClus [17] as an efficient method for detecting overlapping protein complexes such as these. However, methods such as ClusterONE that utilize MMR for overlapping complex detection tend to miss some attachment proteins, which could result in false positives for protein complex detection [18,20]. Filtering methods such as ProRank+ [24] and PEWCC [30] have been adopted to increase the reliability of PPI networks. Recently modularity-based clustering techniques such as PCR-FR [31], CALM [20], ClusterONE [21] and EPOF [32] have been proposed for detecting protein complexes in densely and sparsely connected network structures [13,33,34,35,36,37,38]. Generally, the core of a protein complex is frequently a dense sub-network with attachment proteins that are closely linked to the complex’s core proteins which help these proteins perform auxiliary functions [22]. Protein complexes have an inherent organization and a common architecture [39,40]. Several techniques for identifying protein complex cores based on core-attachment structure have been investigated to this point, including COACH [22], and Core [23].
Another popular technique that has been used in the detection of protein complex cores is the co-attachment method which is often based on the network core-network structure [41,42,43]. Generally, this technique has two steps: namely, the identification of the complex core as a dense sub-network or maximal clique and then the characterization of the core of the protein complex. Although these two steps have been widely adopted in the detection of the protein complexes they tend to be inefficient when attempting to characterize the protein complex core of a dense sub-network [44]. Moreover, the majority of the core-attachment-based methods are based on the selection of proteins whose neighbors interact with more than half of the protein in the complex core in the sparse PPI networks [22]. However, this may result in high false-positive interactions and lead to the inaccurate detection of protein complexes [45,46,47]. The core-attachment structure is still being investigated; no studies have provided a clear distinction between overlapping proteins, core proteins, and peripheral proteins in terms of the weighted network structure [41]. The majority of studies simply focus on a few structural concepts of these protein complexes [20,45,46,47,48,49].
Recently, method such as CALM has shown to be more efficient in the detection of overlapping protein complexes on large-scale PPI networks. However, this method only focuses on the detection of overlapping protein complexes and tends to ignore local attachment proteins to the complex core, as well as it does not consider the common neighborhood and high-order common neighborhood similarity measures when calculating the initial weight of the PPN. All those factors influence the reliability of the PPN and detection of the protein complexes which may result in false positive prediction. To address these limitations, we propose a special structural-based weighted network approach called the Weighted Edge algorithm, Core-Attachment, and Local Modularity (WECALM) for protein complex analysis. By our WECALM approach, our contributions are: First, we introduce a high-order similarity measure based on the Jaccard measure to compute the edge weights, which ensures the reliability of the PPI network. Second, we extend protein complex identification by using a weighted connectivity algorithm to discriminate and detect local attachment proteins to complex cores. Third, we extend the detection of protein complexes using the structural similarity measure concept. Fourth, we perform functional enrichment analysis by calculating the p-value of the detected complexes to validate their associated functions.
This paper is organized as follows. In Section 2, we provide a preliminary overview of our approach. In Section 3, we give a detailed computational description of our approach to the detection of protein complexes. In Section 4, we describe our experimental PPI datasets and evaluation criteria of our proposed approach. In Section 5, we present the experimental results and discuss them. Lastly, in Section 6, we draw some conclusions and outline our future research plans.

2. Preliminaries

In this section, we introduce some fundamental concepts. Generally, the PPI network can be represented as an undirected unweighted, or weighted graph denoted by G = ( V , E ) where V are set of nodes denoting the proteins and E are set of edges corresponding to the interaction between pair of proteins. In our approach, we consider the PPI network to be an undirected edge-weighted graph given by G = ( V , E , W ) , where W denotes the weight on the edge representing the confidence score in the range ( 0 , 1 ] and function W : E R + quantifies the affinity of the interaction between each pair of nodes or proteins (i.e., edge mapping in E). For node v, N ( v ) denotes the set of all neighboring nodes of v. The nodes (proteins) of the PPI graph model can be classified into four major classes with respect to protein complexes (groups of two or more proteins that are physically linked together through non-covalent interactions) according to [21,50,51] (see also Figure 1 and Figure 2). The first class is core nodes: a node is considered to be a core node in the complex if: it shows a high degree of physical interaction; it has a relatively high weighted degree of direct physical interaction among themselves within the complex and less interaction with nodes outside the complex; the set of core nodes unique in each complex. The second classification is peripheral node, a node is considered to be a peripheral node to a complex if: it has a close interaction with the complex core; it is stable and directly interacts with the complex core. The third classification is overlapping nodes; a protein is considered to be an overlapping protein to a complex if: it has a higher degree and acts as a betweenness node than the neighborhood nodes; it interacts closely to the complex core; it belongs to more than one complex. The remaining proteins are classified as interspersed nodes, which is probably just noise in the PPI network.

3. Methods

In this section, we will describe seven main steps of our proposed WECALM algorithm these include: building a weighted PPI network; identifying overlapping structures; identifying seed proteins; identifying local modularity structures; identifying complex core structures; detecting attachment proteins to the complex core and protein core attachment and protein complex formation.

3.1. Building Weighted PPI Network

In general, PPI networks obtained through various experimental techniques are typically noisy and many interactions are presumed to be false positives [52,53]. As a result, we should reduce the rate of false positives. To address this challenge topological properties of PPI networks have been proposed to develop preprocessing strategies for evaluating and eliminating potential false positives [54,55,56,57,58]. According to some experimental findings [59,60,61], neighbor information-based methods are used to evaluate PPI with high confidence scores and are typically more reliable than other methods. Thus, in this study to build a reliable weighted PPI network, we shall use Jaccard’s coefficient similarity ( J s ) [62] to compute the proteins interaction scores. Hence, the similarity between two neighboring proteins v and u is defined by
J s ( v , u ) = N ( v ) N ( u ) N ( v ) ( u ) = C N ( v , u ) N ( v ) N ( u ) ,
where 0 J s ( v , u ) 1 , I is the interaction between proteins v and u, C N ( v , u ) represents the set of common neighbors proteins v and u. N ( v ) N ( u ) represents number of common neighbors of proteins v and u. N ( v ) N ( u ) represents the union set of all the different neighboring proteins of v and u. Thus, with Equation (1), we can calculate weight between two neighbouring proteins v and u by
w ( v , u ) = 1 i f C N ( u , v ) 1 0 o t h e r w i s e
Based on our computation the similarity of two adjacent proteins will be higher if the two proteins share more common neighbors. On this basis, we propose a high-order similarity metric based on Jaccard’s coefficient between proteins v and u to calculate the connectivity between the adjacent proteins v and u in the common neighbor. Now we will define the common neighbors’ support using the formula
ρ ( v , u ) = J s ( v , u ) u C N ( v , u ) w ( v , u ) ,
where ρ is the common neighbor support of the weighted edge ( v , u ) and w is the weight of the edge between protein v and u stated in the preliminary in Section 2. Thus, with Equations (2) and (3), we can define high-order similarity score by the formula
ϕ ( v , u ) = J s ( v , u ) + ρ ( v , u ) 1 + ρ ( v , u ) ,
where ϕ ( v , u ) is the high-order similarity score for the common neighbor of two adjacent proteins and it takes the values in the range [ 0 , 1 ) . For the rest of the paper, ϕ defines the edge weights W.

3.2. Identifying Overlapping Structures

To identify the overlapping structure, let v V , N ( v ) = u u V , ( v , u ) E be set of neighbour protein v and d e g ( v ) = N ( v ) be the number of neighbours of protein v. Given protein v V we can define the neighborhood network GN V = V v , E v as sub-network of protein v and its direct neighbours interacting in network G . Hence V v = v u u V , ( v , u ) E and E v = ( u i , u j ) ( u i , u j ) E , u i , u j V v . Thus, the weighted degree average of a local neighborhood sub-network GN V is defined by the equation
A v g ( d e g ( GN V ) ) = u V v d e g ( u ) V v ,
To calculate the global importance of a protein v, we calculate the shortest paths between all protein pairs that pass through the target proteins by defining the betweenness of node v by
B ( v ) = s v , t v s , t , v V δ s , t ( v ) δ s , t ,
where δ s , t is the number of shortest paths between protein s to t and δ s , t ( v ) is the number of shortest paths between protein s to t that pass through the intermediate (bridge) protein v. Thus, using Equation (6) the average betweenness of its local neighborhood sub-network GN v is calculated by
A v g ( B ( GN v ) = u V B u V v ,
where A v g B ( GN V ) is the average of B ( u ) for all u V v in local neighborhood sub-network GN v and V v denotes the total number of nodes in the PPI network. Then, using Equations (5) and (7) we defined overlapping protein structure in the PPI network by
OVP ( GN V ) = 1 i f d e g ( v ) A v g ( d e g ( GN V ) ) B ( v ) > A v g ( B ( GN V ) ) ) , 0 o t h e r w i s e
where GN v is candidate overlapping protein complex with OVP ( GN V ) = 1 . To measure the degree of overlap between sets of candidate overlapping protein complexes we calculate the overlapping score between the two sets by
O S OVP i , OVP j = OVP i OVP j OVP i + OVP j OVP i OVP j ,
where O S OVP i , OVP j is overlapping score between OVP i and OVP j ranging from [ 0 , 1 ) in which 0 indicates no overlap between the sets and 1 indicates that the sets are identical; OVP i and OVP j denote the sizes of sets OVP i and OVP j , respectively. OVP i OVP j denotes the intersection of sets OVP i and OVP j . In this paper, we identify the candidate overlapping protein complex when O S OVP i , OVP j π , where π is predefined overlap threshold ranging in ( 0 , 1 ] .

3.3. Identifying Seed Proteins

The identification of seed protein for the PPI network is essential for the detection of protein complexes. Here, we introduce the concept of weighted degree and cluster coefficient as a strategy for identifying the seed protein. For this, we defined the weighted node degree by
d e g w ( v ) = u N ( v ) ; ( v , u ) E w ( v , u ) ,
where d e g w ( v ) is the weighted degree of the protein v and w is the edge weight stated in the preliminary in Section 2. To determine the seed protein we consider the small world phenomenon model [63,64] which correspond to local weighted clustering coefficient λ . Then, we define λ v of protein v as measure of its local connectivity among its immediate neighbors and λ w ( v ) of protein v as weighted sub-network GN V formed by N v and their corresponding weighted edges. Thus, we can calculate clustering coefficient of protein v by
λ w ( v ) = u i V v u j N ( u i ) V v w ( u i , u j ) N v × N v 1 ,
where λ w ( v ) is the clustering coefficient of protein v and λ w ( v ) ( 0 , 1 ] . Using Equation (11), we can calculate the average clustering coefficient of sub-network GN V by
A v g ( λ w ( v ) ) = u V v λ w ( v ) V v ,
where λ w ( v ) is the average local weighted clustering coefficient of the protein v, V v is the number of the protein v and all its local neighbours in a sub-network. With Equations (7) and (12), the seed protein ( S ) is defined as
S ( v ) = 1 i f λ w ( v ) A v g ( λ w ( v ) ) B ( v ) A v g ( B ( GN V ) ) , 0 o t h e r w i s e .
where node v is a selected seed protein if S ( v ) = 1 .

3.4. Identifying Local Modularity Structures

To identify local modularity structures, we consider seed proteins calculated by Equation (13) as initial nodes to generate clusters by first computing the support function, followed by the local modularity function. Hence, using Equation (4) first, we calculate the similarity score between a seed protein v and its immediate proteins gradually adding the neighboring proteins with the help of the support function and the local modularity function in order to generate cluster K as sub-network. To prioritize each neighboring protein u, first, we calculate the support function to measure how close the protein u is to the cluster K using the formula
s u p p ( u , K ) = u K N ( u ) w ( u , u ) u N ( u ) w ( u , u ) ,
where s u p p ( u , K ) is the support function in the range [ 0 , 1 ) , u K , and u K N ( u ) w ( u , u ) is the summation of the edge weight linking protein u to K, and u N ( u ) w ( u , u ) is the total degree weight of protein u. The above-prioritizing approach can be extended iteratively for the neighbors of any initial cluster K. Thus, in each iteration step, according to the priority of the neighbors, the decision to join the cluster is made by the local modularity function.
Given subnetwork K of G , we can define weights in-degree as the sum of the weight of edges linking protein u to other proteins in K denoted by w i n ( K ) and weighted out-degree as the sum of the weight of edges linking protein v to proteins in the rest of G K denoted by w o u t ( K ) . Thus, we can define w i n ( K ) and w o u t ( K ) by
w i n ( K ) = u , u K w ( u , u ) W w ( u , u ) ,
and
w o u t ( K ) = u K , u K w ( u , u ) W w ( u , u ) ,
where w represents the weight of the edges in sub-network K. To determine the local modularity structure in sub-network K, we defined modular uncertainty correction threshold value η in the interval of [ 0 , 1 ) . Using Equations (15) and (16), we can define the local modularity of sub-network K by
Q ( η , K ) = w i n ( K ) w i n ( K ) + w o u t ( K ) + η · V K α ,
where Q ( η , K ) takes a value ( 0 , 1 ) ; V K is total number of proteins in K, η is predefined modular uncertainty correction parameter in the range of ( 0 , 1 ] , α is the ratio of the internal interaction to the total interaction in the community. We set α = 1.0 in order to detect high w i n ( K ) and a low w o u t ( K ) which makes it efficient in the detection of local modularity structure. A neighboring node is added to K, if extending K by the given node, the value of the local modularity function would increase.

3.5. Identifying Complex Core Structure

To detect the complex core, let v V , N ( v ) be the set of all immediate neighbor proteins, and the structural neighborhood of protein v is given by N s v = v N v , in which N s ( v ) entails protein v and its direct neighbors. Now, we can calculate the structural similarity between two neighboring proteins v and w by
S S ( v , w ) = N s ( v ) N s ( w ) N s v N s w ,
where S S ( v , w ) structural similarity is in the range of ( 0 , 1 ] . Here, high S S ( v , w ) between two proteins indicates that the two proteins shared a similar neighborhood structure. Moreover, the structural similarity is symmetric as S S ( v , w ) = S ( w , v ) . Based on S S ( v , w ) we mine a sub-network in the neighborhood network GN V , which we refer to preliminary complex core. We introduce ω as the default threshold value to compute the optimal structural similarity score between seed protein, v, and each neighbor w N ( v ) from the identified preliminary protein complex C p ( v ) . Hence, using Equation (18) given the preliminary complex C p ( v ) , and structural similarly threshold ω , we can calculate the preliminary complex core of protein v by
C o r e ( ω , C p ( v ) ) = { w C p ( v ) : S S ( v , w ) ω }
where C o r e ( ω , C p ( v ) ) is the preliminary complex core; ω is a default threshold value in ranging from ( 0 , 1 ] ; C p ( v ) denotes the preliminary complex of protein v. Note that protein v is included in the C o r e ( ω , C P ( v ) ) .

3.6. Detection of Attachment Proteins to Complex Core

Generally, attachment proteins exist in two forms, namely overlapping and peripheral protein attachments [65]. Therefore, to identify protein attachment to the complex core, consider the identified preliminary protein complex denoted by C p ( v ) , the preliminary complex core as a sub-network represented by C o r e ( ω , C p ( v ) ) = ( V c , E c ) and the set C A P ( C p ( v ) ) of candidate attachment proteins as a subset of the neighbors of C o r e ( ω , C p ( v ) ) . Here, our two main objectives are: first, to find a subset C A P ( C p ( v ) ) V in PPI network in which each protein p C A P ( C p ( v ) ) is a candidate attachment protein with identified preliminary protein complex C P ( v ) , and secondly, to predict the category of each protein in C A P ( C p ( v ) ) .
To achieve the two objectives we set two basic conditions, namely: (1) The attached proteins must interact with the complex cores directly; (2) The attached proteins must be connected to at least two core proteins via complex cores since protein complexes are made of two or more complexes [66]. Therefore, if protein p fulfils the conditions as belonging to the neighborhood of C o r e ( ω , C P ( v ) ) with | N ( p ) V c | 2 , then it is selected for C A P ( C p ( v ) ) . Below we provide a detailed description of the calculation of the overlapping and peripheral protein attachment to the complex core.

3.6.1. Overlapping Attachment Proteins

To identify overlapping protein attachment let C A P ( C P ( v ) ) attached from preliminary complex protein C P ( v ) and OVP ( C P ( v ) ) be the set of candidate overlapping proteins attached to the preliminary complex protein C P ( v ) . We can define weighted candidate protein for a candidate overlapping attachment protein p OVP ( C P ( v ) ) interacting with proteins in complex core C o r e ( ω , C P ( v ) ) by
d w ( p , C o r e ( ω , C P ( v ) ) = t V c w ( p , t ) ,
Next, we calculate the average weight of interaction for all candidate core protein p within complex core C o r e ( ω , C P ( v ) ) by the formula
A v g ( d w ( O V P C P ( v ) ) = p O V P ( C P ( v ) d w ( p , C o r e ( ω , C P ( v ) ) O V P ( C P ( v )
Using Equations (20) and (21), we defined the score of the candidate overlapping protein attachment to the complex core C P ( v ) by
OVP ( p , C o r e ( C P ( v ) ) = 1 i f d w ( p , C o r e ( ω , C P ( v ) ) A v g ( d w ( O V P C P ( v ) ) ) , 0 o t h e r w i s e
Then, the set OVP ( C o r e ( C P ( v ) ) denotes the set of local overlapping attachment proteins p for which OVP ( p , C o r e ( C P ( v ) ) = 1 .

3.6.2. Peripheral Attachment Protein

Here, we consider the set of candidate peripheral proteins P P ( C P ( v ) ) obtained by the difference of C A P ( C P ( v ) ) OVP ( C o r e ( C P ( v ) ) . Given the weight of the connectivity of proteins p P P ( C P ( v ) ) with respect to the complex core as d w ( p , C o r e ( ω , C P ( v ) ) , we define the average weight of interactions of all candidate peripheral proteins with C o r e ( ω , C P ( v ) ) by
A v g ( d w ( P P C P ( v ) ) = p P P ( C P ( v ) d w ( p , C o r e ( ω , C P ( v ) ) P P ( C P ( v )
Hence, using Equation (23), we define the score of peripheral attachment protein by
PP ( p , C o r e ( C P ( v ) ) = 1 i f d w ( p , C o r e ( ω , C P ( v ) ) A v g ( d w ( P P C P ( v ) ) ) , 0 o t h e r w i s e
Then the set PP ( C o r e ( C P ( v ) ) denotes the set of local peripheral attachment proteins p for which PP ( p , C o r e ( C P ( v ) ) = 1 .

3.7. Protein Core Attachment and Protein Complex Formation

To detect protein complex formation, we first compute the core-attachment proteins by aggregating the overlapped and peripheral protein scores to generate the overall set of attachment proteins in the complex core defined by the formula
A ( C o r e ( C P ( v ) ) = OVP ( C o r e ( C P ( v ) ) PP ( C o r e ( C P ( v ) ) ,
where A ( C o r e ( C P ( v ) ) is the overall local attachment proteins to the complex core C o r e ( C P ( v ) ) . Next, the protein complex formation is computed by merging sets of preliminary complex cores (see Equation (19)) and the set of detected candidate attachment proteins (see Equation (25)). Hence, using Equations (22) and (25) we defined the score of final protein complex formation by
C P ( v ) = 1 i f | OVP ( C o r e ( C P ( v ) ) | 2 | C o r e ( ω , C P ( v ) ) | > 3 | A ( C o r e ( C P ( v ) ) | , 0 o t h e r w i s e
Therefore, we define the set of distinct protein complexes using the formula
CP ( v ) = C o r e ( ω , C P ( v ) ) A ( C o r e ( C P ( v ) ) ) ,
where the protein complexes above are defined only if C P ( v ) = 1 .

4. Datasets and Evaluation Criteria

In this section, we will provide a general description of the experimental PPI datasets and evaluation criteria used to validate and compare the performance of our WECALM approach.

4.1. Experimental PPI Datasets

In our study, the three freely accessible PPI networks extracted from S.cerevisiae were used for simulation. They were the DIP [67] database that documents experimentally determined Protein–Protein interactions, BioGRID [68] database of physical and genetic interactions, and the Yeast database [17,69]. A brief description of the dataset used for the simulation is given in Table 1. The data from Human [69] was used to build Human PPI networks.
For complex simulation data, we used the yeast reference datasets CYC2008 [70] and NewMIPS [71,71] for complex simulation studies. For human complexes, we used data from the CORUM [69], PINdb [72], and KEGG modules [73] databases. In addition, for functional enrichment analysis, we utilised Aloy [50] and SGD [74] for Gene Ontology. Table 2 lists the details of the benchmark protein complexes employed in this study.

4.2. Evaluation Criteria

We compared the identified protein complexes with the reference complexes to determine how well the algorithms identify protein complexes. To make comprehensive and detailed comparisons, we utilized a wide range of evaluation metrics such as recall, precision, F-measure, coverage rate, and others, as suggested by related studies [20,22,23,75]. In the subsection below, we provide a detailed description of these metrics.

4.2.1. Computation of Recall, Precision and F-Measure

To calculate evaluation metrics, we must first compute the similarity between detected and reference complexes based on neighborhood affinity in order to measure their closeness [53,76,77]. Hence, let P = { p 1 , p 2 , , p k } be detected protein complexes C P ( v ) and R = { r 1 , r 2 , , r l } be the reference protein complexes. Here, we denote the detected and reference proteins complexes by p i and r j respectively. Thus, neighborhood affinity between the detected and reference protein complexes is calculated like so:
N A ( p i , r j ) = N ( p i ) N ( r j ) 2 N ( p i ) N ( r i ) ,
where N A ( p i , r j ) is the neighborhood affinity in the range of [ 0 , 1 ) , | N ( p i ) | represents the size of detected complex, | N ( r j ) | represents the size of the reference complex, and | N ( p i ) N ( r j ) | denotes the number of common proteins from the detected and reference complexes. Here, the larger the N A ( p i , r j ) , the closer the two complexes are. Given a threshold κ , i f N A ( p i , r j ) κ , then p i is similarly matched with r j so we set κ = 0.2 according to [22,53,78]. From Equation (28) we can calculate recall, precision and F-measure. Let N P = | p | p P , r R , N A ( p , r ) κ and N R = | r | r R , p P , N A ( r , p ) κ be the number of the corrected detected and reference complexes that match at least one real protein and detected complex, respectively. Now, we define recall and precision using the formula
R e c a l l = r | r R , p P , N A ( r , p ) κ R = N R R ,
and
P r e c i s i o n = p | p P , r R , N A ( p , r ) κ P = N P P ,
In general, a smaller protein complex has a higher precision, and a larger protein complex has a higher recall hence the two metrics often have an inverse relationship. Since the F-measure is the harmonic mean of recall and precision using Equations (29) and (30), we can define the F1-measure by
F 1 m e a s u r e = 2 × P r e c i s i o n × R e c a l l P r e c i s i o n + R e c a l l ,

4.2.2. Coverage Rate

To evaluate the performance of our proposed WECALM algorithms and peer methods it is necessary to determine the number of potentially covered proteins in the reference complexes by a computation of the coverage rate (CR) [75,77,79]. To calculate the coverage rate, let P and R be the sets of detected and reference protein complexes, respectively. Hence we can represent the matrix of the detected complexes and the reference complexes | R | × | P | by M, where each component of the matrix m a x { M i j } is the maximum number of proteins sharing a similar function relationship between the i t h and j t h reference complex and detected complex respectively. Now we defined coverage rate by
C R = i = 1 | R | max { M i j } i = 1 | R | N i ,
where C R is the coverage rate and N i denotes the number of proteins in the i t h reference complex.

4.2.3. Maximum Matching Ratio

The maximum match ratio, or MMR, is a metric based on the maximum one-to-one mapping between the detected and reference complexes. MMR directly penalizes a reference complex that has been split into two or more parts in the detected set because only one of these parts is permitted to match the correct reference complex. MMR offers a natural, simple method for comparing detected complexes to reference complexes [20,80]. We compute the MMR using a weighted edge between the detected and the reference complexes calculated based on the neighborhood affinity score defined in Equation (28). That is, the maximum match ratio is
M M R = i = 1 | R | max j = 1 n N A { p i , r j } i = 1 | R | N i ,
where N A { p i , r j } is the neighborhood affinity score; R is the number of the reference complexes, n is the number of detected complexes; j is a member of the detected complexes; N i is the number of proteins in the i t h reference complex; r j is the j t h reference complex and p i is the i t h detected complex.

4.2.4. Separation and ACC

To avoid the case where proteins of a reference complex are matched with several detected protein complexes we used Separation (Sep) to calculate a one-to-one correspondence between detected protein complexes and reference protein complexes [20]. Here, we defined Separation by
Sep p i = i = 1 | R | j = 1 m Sep ij | R | , Sep r j = j = 1 m i = 1 | R | Sep ij m , Sep = Sep p i × Sep r j ,
where Sep ij = ( t i j ) 2 i = 1 | R | t i j j = 1 m t i j , | R | is the number of protein complexes in the reference complexes, m is the number of proteins in detected complexes, t i j denotes the degree of intersection between the i t h reference complex and the j t h detected complex, and N i is the number of proteins within the i t h reference complex. To quantify the quality of detected protein complexes, we compute the geometric means of sensitivity and the positive predictive value (PPV) to obtain the Accuracy ACC [20]. To measure ACC, we used the following formula
S n = i = 1 | R | max j = 1 m { t i j } i = 1 | R | N i , P P V = i = 1 m max j = 1 | R | { t i j } j = 1 m i = 1 | R | { t i j } , A C C = S n × P P V ,

4.2.5. Functional Enrichment Analysis

Even though known protein complexes are often insufficient or incomplete in laboratory-based experiments, it is always necessary to annotate the biological function of the detected complexes by computing the p-value and perform Gene Ontology functional enrichment analysis as a confirmatory test of the biological significance of the detected complexes [9,11,81]. To calculate the significance value of the biological function, we define the p-value by
p v a l u e = 1 i = 0 m 1 F i N F C i N C ,
where m is the number of observed proteins in the functional group of the detected complex, N is the total number of proteins in a PPI network, C is the size of the detected protein and F represents the size of functional group. Note that in our analysis the p-value is calculated based on the biological processes term descriptions (or ontologies) and the smaller value the more the biological significance that protein complex has. Hence, protein complex with a p-value < 0.01 is deemed to be biologically significant in the PPI network.

5. Results and Discussion

In this section, we will present and discuss the findings of WECALM’s performance compared with other algorithms, followed by parametric selection, computational complexity analysis, and validation with function enrichment analysis.

5.1. Performance Comparison of WECALM with Other Algorithm

In our study, it was necessary to compare the performance of our proposed WECALM algorithm with other existing protein complex detection algorithms based on the evaluation criteria stated in Section 4.2. Therefore, we compared our WECALM method with ten recently developed complex detection algorithms namely, CFinder, MCL, COACH, EWCA, Core, CALM, ClusterONE, GMFTP, ProRank+, and CMC. To fairly evaluate the ten algorithms we set the optimal parameters of each algorithm based on the author’s recommendations to obtain the results [10,20,82].

5.1.1. Performance on NewMIPS Complexes

We compared the robustness of our proposed WECALM method to other existing methods for detecting protein complexes. We considered the evaluation matrices described in Section 4.2. Based on the NewMIPS complex using the BioGRID dataset (see Figure 3A), we found that WECALM performed better in terms of recall (0.7701), F-measure (0.7252), coverage rate (0.6743), and maximum matching ratio (0.3975). In terms of precision score (0.7131), ProRank+ outperformed all other methods. On the NewMIPS complex using the DIP dataset (see Figure 3B), again we observe that WECALM performed better in terms of recall (0.7141), F-measure (0.5889) and maximum matching ratio (0.3531). ProRank+ (0.6657) and CMC (0.5736) performed best in terms of precision and coverage rate respectively. Though, ProRank+ and CMC in terms of precision and coverage rate WECALM performed best in the overall composite score. Table A1 in Appendix A provides supplementary results for performance comparison of WECALM and another algorithm on the NewMIPS dataset using the BioGRID and DIP complexes.

5.1.2. Performance on CYC2008 Complexes

We also compared the performance of WECALM with other algorithms based on CYC2008 complexes using both the BioGRID and DIP datasets. Based on the CYC2008 complex using the BioGRID dataset (see Figure 4A), we see that WECALM performed better in terms of recall (0.8291), F-measure (0.6956), CR (0.8831), and MMR (0.4825). In terms of precision score (0.6622), ProRank+ performed better than other methods. On the DIP dataset (see Figure 4B), again we observe that WECALM performed better in terms of recall (0.7315), F-measure (0.6315), and maximum matching ratio (0.3866). ProRank+ (0.6924) and GMFTP (0.6085) performed best in terms of precision and coverage rate respectively. Though, ProRank+ and GMFTP in terms of precision and coverage rate WECALM performed best in the overall composite score. Table A2 in Appendix A provides supplementary results for evaluation of the performance of WECALM and another algorithm on CYC2008 complexes using the BioGRID and DIP datasets.
WECALM’s performance was also evaluated in terms of separation and ACC. According to the results in Table A1 and Table A2 in Appendix A, WECALM outperformed all other methods in separation and ACC on both the NewMIPS and CYC2008 complexes both the BioGRID and DIP datasets. A high separation measure indicates that the detected complexes are well separated from one another, indicating good algorithm performance while an ACC score close to 1 indicates perfect performance, meaning that the algorithm detected all the true complexes. A low separation measure indicates that the complexes are overlapping or clumped together, which might mean false positive complexes or inaccurate detection of true complexes while an ACC score of less than 1 indicates that some of the detected complexes were false positives.

5.2. Parametric Selection

Here, we shall evaluate the effects of adjusting the threshold value of π , η , and ω on the overlapping score, local modularity score, and core structural similarity score, respectively on the performance of the WECALM.

5.2.1. Effect of Varying π on the Performance of WECALM

The overlapping score measures the similarity between two protein complexes, and in our simulation, we measure the degree of overlap between sets of candidate overlapping protein complexes using Equation (9). Hence, to assess the effect of π on the performance of the WECALM, we adjust the default threshold value, π from 0.1 to 1.0 with a 0.1 increment, then calculate the composite score based on evaluation metrics including Recall, Precision, F-measure, CR, and MMR. We used the BioGRID and DIP yeast PPI complexes in Table 1 and the NewMIPS and CYC2008 reference protein complexes in Table 2. Figure 5 shows the composite score for WECALM performance at different π values on the BioGRID and the DIP datasets. In Figure 5, we noticed that both the BioGRID (Figure 5A) and DIP (Figure 5B) complexes on the NewMIPS the recall, MMR, and CR scores decrease with increase in π , while the precision and F-measure score is maximum at π = 0.8 and π = 0.4 , respectively. Figure 5B. We also investigated the effect of π on the performance of WECALM using the CYC2008 reference protein complexes. Again it can be seen that on both the BioGRID (Figure 5C) and DIP (Figure 5D) complexes the recall, and CR scores decrease with an increase in π , whereas the precision and F-measure score is maximum at π = 0.80 and π = 0.40 , respectively. The MMR is a maximum when π = 0.2 . However, we notice that WECALM has a higher CR score on the BioGRID complex compared to the DIP complex. The overall performance of WECALM is best at π = 0.4 for both the BioGRID and DIP complexes which provide insights into overlapping structural similarities and differences between different protein complexes. Therefore, by tuning the optimal π value, our WECALM approach can achieve good accuracy and reliable prediction results.

5.2.2. Effect of Varying η on the Performance of WECALM

The local modularity structural similarity score threshold describes the similarity between the local modular structures of protein complexes. In this study, it was also necessary to evaluate the effect of η on the performance of WECALM. To evaluate the WECALM performance, we measured the Recall, Precision, F-measure, CR, and MMR using the BioGRID and DIP yeast PPI complexes (see Table 1) and the NewMIPS and CYC2008 reference protein complexes (see Table 2). To evaluate the performance of WECALM we adjusted the η threshold value in the range of [ 0.1 , 1.0 ) with a 0.1 increment and set η > 0 and η 0 . In Figure 6A,B, we noticed that in the BioGRID and DIP complexes on NewMIPS the recall, MMR, and CR scores decrease with an increase in η value, whereas the precision and F-measure score is maximum at η = 0.8 and η = 0.4 , respectively. Using CYC2008 reference protein complexes we can see that in both BioGRID (Figure 6C) and DIP (Figure 6D) the recall, MMR, and CR scores decrease with an increase in η , whereas the precision and F-measure score is maximum at η = 0.80 and η = 0.40 , respectively. However, WECALM has a higher CR score on the BioGRID complex compared to the DIP complex. The overall performance of WECALM is best to perform at η = 0.40 for both the BioGRID and DIP complexes and on both NewMIPS and CYC2008 reference complexes, which provide an insight into local modularity structural similarities and differences between different modular proteins in the protein complexes. An algorithm with optimal η is able to identify common features or properties of protein complexes, which can give insights into the mechanisms of cellular processes.

5.2.3. Effect of Varying ω on the Performance of WECALM

The core structural similarity score threshold describes the similarity between the core structures of protein complexes. In this study, an evaluation of this parameter was essential to ascertain whether WECALM can correctly detect the core-protein complexes, which plays a key role in the study of the functional relationships between different protein complexes, by comparing their core structures to see if they have similar functions. This can provide insights into the functional roles of protein complexes in cellular processes. Therefore, to evaluate the WECALM performance, we measure the Recall, Precision, F-measure, CR, and MMR using the BioGRID and DIP yeast PPI complexes (see in Table 1) and NewMIPS and CYC2008 reference protein complexes (see Table 2). To evaluate the performance of WECALM we adjusted the ω threshold value in the range of [ 0.1 , 1.0 ) with a 0.1 increment and set ω 0 . In Figure 7, we see that in both BioGRID (Figure 7A) and DIP (Figure 7B) complexes on NewMIPS the recall, MMR, and CR scores decrease with increase in ω , whereas the precision and F-measure score are maximum at ω = 0.8 and ω = 0.4 , respectively. At the same time, we evaluated the effect of ω on the performance of WECALM using the CYC2008 reference protein complexes. Again we notice that in both the BioGRID (Figure 7C) and DIP (Figure 7D) the recall, MMR, and CR scores decrease with increase in ω , while the precision and F-measure score are maximum at ω = 0.80 and ω = 0.40 , respectively. However, WECALM has higher CR score on BioGRID complex compared in DIP complex. Overall performance of WECALM is best at ω = 0.40 for both the BioGRID and DIP complexes which provide an insight into core structural similarities and differences between different protein complexes cores.

5.3. Computational Complexity Analysis

In this study, it was also crucial to perform computational complexity analysis to assess the efficiency of WECALM relative to other algorithms in terms of the time required to detect the total number of protein complexes in standard complexes. To compare the computational complexity of each algorithm for simplicity we ran each program with its default settings. We then compared the time taken to detect the total number of detected protein complexes and matrices including the F-measure, CR, MMR, Sep, and ACC. In this analysis, we used reference Human and Yeast reference complexes (see Table 2).
To compare the computational complexity of each algorithm we set the parameters of the other eight algorithms based on the authors’ recommendations while for our proposed WECALM we set ω , π , and η at the default values obtained from the experimental results given in Section 5.2. We discovered in Table 3 that WECALM and EWCA had low computational complexity, indicating good efficiency in detecting the total amount of protein complexes in Human standard complexes. Furthermore, when compared to other algorithms, WECALM had the highest MMR, Sep, and ACC scores, demonstrating a balance between accuracy and efficiency. On standard yeast complexes, a similar performance trend was seen, with WECALM and EWCA having the lowest time computational complexity. However, we found that WECALM detected more protein complexes with a better performance efficiency than other algorithms, making it the overall best-performing algorithm for the detection of protein complexes on both Human and Yeast standard complexes.

5.4. Function Enrichment Analysis

We investigated the biological significance of our detected protein complexes to confirm the effectiveness of our WECALM approach because the reference complexes were incomplete. Each identified complex has a p-value calculated by Equation (36) for enrichment analysis. A complex is considered biologically significant if its p-value is less than p 10 2 after being detected using a wide range of methods. A complex with a lower p-value has a statistically significantly greater biological significance. Using SGD’s GO Term Finder web service [83], we validated the functional relationships and cellular mechanism of the detected complexes based on biological process terms. In this case, the smallest p-value across all gene ontology terms represents the functional homogeneity of each identification complex. We also evaluated the protein complexes identified by WECALM and calculated the p-value of protein complexes identified by MCL, COACH, Core, CALM, EWCA, CFinfer, Core, CALM, GMFTP, ClusterONE, CMC, and ProRank+ whose sizes were 3  Table 4 shows the p-value test results for MCL, COACH, Core, CALM, EWCA, CFinfer, GMFTP, ClusterONE, CMC, ProRank+, and WECALM. To compare the biological significance of protein complexes identified by different algorithms, we computed the number of detected complexes, the total number of detected complexes, and the percentage of detected complexes in different p-value ranges. We discovered on the one hand that the majority of algorithms only consider the percentage of detected complexes. The p-values of identified protein complexes, on the other hand, are proportional to their size [22,23,84,85].
When analyzing the function enrichment of identified protein complexes, it is essential to consider both the quantity and the proportion of the identified complexes. On the BioGRID dataset, as shown in Table 4, WECALM detected 97.45% of the significant new protein complexes, slightly less than ProRank+, which recorded the highest significant score (97.59%). The size of the protein complexes identified by WECALM is typically larger than that of other algorithms such as ProRank+, which is most likely why. WECALM detects far more protein complexes than ProRank+. MCL, COACH, Core, CALM, EWCA, CFinfer, GMFTP, ClusterONE, and CMC is 107, 161, 1035, 463, 1341, 269, 449, 210, 832, and 728 protein complexes in the BioGRID dataset, respectively. We also observe that WECALM has detected a maximum of 1376 protein complexes, significantly better than ProRank+. On the DIP dataset, WECALM detected 96.17% of significant protein complexes, compared to ProRank+’s 93.79%, an increase of about 3%. At the same time, WECALM also identified the most protein complexes. In the DIP dataset, there were 113, 144, 315, 603, 869, 272, 350, 235, 146, and 319 protein complexes identified by MCL COACH, Core, CALM, EWCA, CFinfer, GMFTP, and CMC, respectively. In general, as the percentage of detected protein complexes decreases, the proportion of significant protein complexes increases. COACH, CFinder, and GMFTP discovered far fewer protein complexes than WECALM. Nevertheless, when compared to the WECALM method, they had a lower percentage of significant protein complexes. In terms of the total number of detected protein complexes and the percentage of detected complexes, WECALM outperformed the other methods in terms of functionality and biological significance. According to their p-value, these protein complexes detected by WECALM have a higher probability of being actual protein complexes.
The WECALM detected five protein complexes with extremely low p-values using the BioGRID and DIP complex datasets, as shown in Appendix B (see Table A3 and Table A4), to further validate the biological significance of the identified complexes. The Cluster frequency, Genome frequency, Biological Process p-values, False Discovery Rate (FDR), False Positive score value, and Gene Ontology term descriptions were all evaluated in our analysis. Cluster frequency is a metric employed in the evaluation of algorithms designed to detect protein complexes. It represents the number of times the algorithm detects a specific complex across several replicates or runs of a similar test.
In Table A3 in Appendix B we can see that WECALM detected protein complexes with a high cluster frequency on the BioGRID complexes. A high cluster frequency implies that the WECALM detects a protein complex consistently throughout multiple runs, implying a good performance. This also means that most of the detected protein complexes in the BioGRID dataset closely match the gene ontology term and have a functional relationship with high statistical significance. According to results in Table A4, WECALM detected protein complexes with a high cluster frequency on the DIP complexes a clear indication of good performance. In addition, in Table A5, we see that WECALM detected a large number of complexes with a 100% cluster frequency. In the detection of protein complexes, a cluster frequency of 100% means that a particular complex is detected in all runs of the test. This is regarded as a very good indication of the findings’ robustness and repeatability, as it implies that the complex is consistently detected by the algorithm across numerous replicates of the same experiment. WECALM had a very low p < 10 10 , indicating that the detected protein complexes were biologically significant and meaningful and that they were most likely the true protein complexes, this can be used as a valuable benchmark in future research.

6. Conclusions

Advancements in biological mechanism research have led to the discovery of more disease-associated genes. Analyzing the Protein–Protein interaction (PPI) networks of these genes can help identify new disease-associated genes and clarify their role in specific diseases. This study proposes a new approach called WECALM, which uses a structural-based weighted network analysis of protein complexes using experimentally determined PPI datasets.
WECALM combines different graph mining algorithms based on protein complex structures and local attachment proteins to predict the inherent structure of the protein complexes in the PPI network. The approach introduces a new edge weight calculation method based on the Jaccard similarity measure between interacting proteins in the PPI networks, which improves the reliability of PPI networks for the accurate detection of protein complexes. It also integrates different network structural-based algorithms to detect overlapping structures, local modularity structures, and co-attachment structures in PPI networks, making it more robust in detecting protein complexes with different structures and densities than existing methods [14,15,16,22,23].
The study demonstrates that WECALM outperforms existing methods in terms of accuracy, computational speed, and the ability to detect biologically significant new protein complexes. Its biggest biological relevance could be that it helps to reduce false positive detection of protein complexes predicted with topological-based only methods. Nevertheless, the accuracy and efficiency of protein complex detection depend on predefined parameter tuning and the size of the PPI network. Additionally, WECALM is an in silico method applied to PPI networks, and future research should confirm its efficiency on other biological networks and conduct laboratory tests to validate its findings.

Author Contributions

Conceptualization, P.J.O., J.D., M.K. and T.K.; methodology, P.J.O., J.D. and M.K.; software, P.J.O.; visualization, P.J.O.; formal analysis, P.J.O., J.D., T.K. and M.K.; investigation, P.J.O., J.D., M.K. and T.K.; supervision, J.D. and M.K.; project administration, M.K.; funding acquisition, M.K. All authors have read and agreed to the published version of the manuscript.

Funding

M.K. gratefully acknowledges the European Commission for funding the InnoRenew CoE project (Grant Agreement no. 739574) under the Horizon2020 Widespread-Teaming program and the Republic of Slovenia (Investment funding of the Republic of Slovenia and the European Union of the European Regional Development Fund). He is also grateful for the support of the Slovenian Research Agency (ARRS) through grants J2-2504, N1-0223 and N2-0171. T.K. was supported by the National Laboratory of Biotechnology through the Hungarian National Research, Development and Innovation Office—NKFIH grant No. 2022-2.1.1-NL-2022-00008.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The code and datasets related to this study is available at https://github.com/peter26jumaochieng (accessed on 20 March 2023).

Acknowledgments

The research was supported by the Ministry of Innovation and Technology NRDI Office within the framework of the Artificial Intelligence National Laboratory Program (RRF-2.3.1-21-2022-00004). The research was also funded by the National Research, Development, and Innovation Fund of the Ministry of Innovation and Technology of Hungary under the TKP2021-NVA (Project no. TKP2021-NVA-09) funding scheme.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A. Performance Comparison of WECALM with the Other Algorithms

Performance Comparison

Table A1 and Table A2 compare the performance of WECALM with the other algorithms based on NewMIPS and CYC2008, respectively. We list the performance score of each method in terms of Recall, Precision, F-Measure, CR, MMR, Sep, and ACC scores in both tables.
Table A1. Performance comparison of WECALM with other algorithms on the NewMIPS Complexes.
Table A1. Performance comparison of WECALM with other algorithms on the NewMIPS Complexes.
DatasetAlgorithmRecallPrecisionF-MeasureCRMMRSepACC
BioGRIDMCL0.28960.20110.23740.29950.16720.26790.2167
COACH0.72560.25810.38080.63220.25250.55320.2777
EWCA0.75610.59230.66430.64970.37640.61490.5221
CFinder0.59140.19650.29500.44020.28010.39120.2131
GMFTP0.75320.28310.41150.51870.25520.51740.4522
Core0.56190.14880.23520.58820.14370.45750.3456
CLAM0.73520.66810.70010.61580.31450.64780.5576
ClusterONE0.59140.31330.40960.53110.19310.48510.2951
CMC0.51310.27310.35650.49540.31750.39760.5313
ProRank+0.48170.71310.57500.47630.24110.62760.5119
WECALM0.77010.68530.72520.67430.39750.77650.6015
DIPMCL0.51480.17830.26490.32710.16550.29270.1655
COACH0.57310.51060.54000.33530.20060.38330.2917
EWCA0.70120.48920.57630.39820.30940.61980.5994
CFinder0.57620.24080.33970.24030.21280.35430.3613
GMFTP0.69820.27560.39520.40440.22280.55480.4229
Core0.44210.17460.25030.39020.12490.54390.4374
CLAM0.58950.52130.55330.43640.29620.66250.5456
ClusterONE0.40540.30210.34620.24170.21780.30410.3185
CMC0.59320.41520.48850.57360.24990.37930.3151
ProRank+0.40860.66570.50640.24450.16690.64510.5567
WECALM0.71660.49980.58890.41950.35310.81950.6317
CR: Coverage Rate; MMR: Maximum Match Ratio; Sep: Separation; ACC: Geometrical Mean Accuracy; bold value indicates the best score.
Table A2. Performance comparison of WECALM with other algorithms on the CYC2008 complexes.
Table A2. Performance comparison of WECALM with other algorithms on the CYC2008 complexes.
DatasetAlgorithmRecallPrecisionF-MeasureCRMMRSepACC
BioGRIDMCL0.35160.22680.27570.53130.16450.38310.2549
COACH0.76690.24880.37570.87520.30420.53750.4117
EWCA0.81910.57930.67860.87180.43510.65780.6035
CFinder0.57240.16370.25460.61350.31150.56340.4215
GMFTP0.78390.29140.42490.79560.39140.61920.4591
Core0.58470.15270.24220.80580.20810.41260.2742
CLAM0.69840.62110.65750.85830.39680.72930.7153
ClusterONE0.66120.34870.45660.75690.27340.51620.3574
CMC0.46440.26770.33960.76390.33750.46110.4375
ProRank+0.41530.66220.51050.58510.24620.61050.5979
WECALM0.82910.59910.69560.88310.48250.78250.6983
DIPMCL0.51690.18470.27220.48920.22990.35190.3125
COACH0.54230.51670.52920.48790.27640.42720.3967
EWCA0.70760.52390.60200.58060.37660.64360.5527
CFinder0.55080.23980.33410.27880.38070.37580.4187
GMFTP0.66520.26640.38040.60850.33160.62350.4136
Core0.46180.18180.26090.53170.24330.33510.3519
CLAM0.64650.49150.55840.53450.32210.68330.6721
ClusterONE0.42790.33430.37540.37510.21910.39570.3519
CMC0.49320.41250.44930.57550.25010.45760.3251
ProRank+0.37720.69240.48840.32940.20290.59290.5817
WECALM0.73150.55560.63150.59160.38660.75960.6569
CR: Coverage Rate; MMR: Maximum Match Ratio; Sep: Separation; ACC: Geometrical Mean Accuracy; bold value indicates the best score.

Appendix B. A Function Enrichment Analysis

Here, we provide supplementary results for a functional enrichment analysis to confirm the biological significance of detected complexes by WECALM on the BioGRID and DIP complexes listed in Table A3 and Table A4, respectively. Table A3 and Table A4 provide a list of the complex ID, Cluster frequency, Genome frequency, p-value (Biological process), False Discovery Rate (FDR), False Positive score and Gene Ontology term description. Our evaluation the selection of significant Gene Ontology term is purely based on the e Cluster frequency, p-values and FDR values.

Appendix B.1. A Function Enrichment Analysis on BioGRID Complex

Table A3 lists significant GO Ontology terms shared by proteins in the BioGRID complexes dataset. From the results we notice that the majority of detected protein complexes match the Gene ontology term well. In addition, it can be seen that the p-value of detected complexes is very low, which implies that the detected protein complexes have a high statistical significance.
Table A4 lists significant GO Ontology terms shared by proteins in the DIP complexes dataset. Similar to BioGRID complex dataset, we see in the DIP complex that most of the detected protein complex match the Gene ontology term well. We notice that the p-value of detected complexes is very low, which implies that the detected protein complexes have a high statistical significance.
Table A3. Top five protein complexes with significant low p-value detected by WECALM on BioGRID complex.
Table A3. Top five protein complexes with significant low p-value detected by WECALM on BioGRID complex.
Complex IDCluster FrequencyGenome Frequencyp-Value (BP)FDRFALSE PositiveGene Ontology Term
19 of 12 genes,75.0%44 of 7166 genes, 0.6% 1.93 × 10 16 0.00000.0000positive regulation of transcription elongation by RNA polymerase II
9 of 12 genes, 75.0%47 of 7166 genes, 0.7% 3.72 × 10 16 0.00000.0000regulation of transcription elongation by RNA polymerase II
9 of 12 genes,75.0%52 of 7166 genes, 0.7% 1.00 × 10 15 0.00000.0000positive regulation of DNA-templated transcription, elongation
9 of 12 genes,75.0%55 of 7166 genes, 0.8% 1.73 × 10 15 0.00000.0000regulation of DNA-templated transcription elongation
9 of 12 genes, 75.0%96 of 7166 genes, 1.3% 3.47 × 10 13 0.00000.0000transcription elongation by RNA polymerase II
212 of 13 genes, 92.3%936 of 7166 genes, 13.1% 4.40 × 10 4 0.00000.0000amide metabolic process
12 of 13 genes,92.3%1348 of 7166 genes,18.8% 8.30 × 10 4 0.00000.0000organonitrogen compound biosynthetic process
12 of 13 genes, 92.3%1816 of 7166 genes, 25.3% 1.18 × 10 3 0.00000.0000cellular nitrogen compound biosynthetic process
12 of 13 genes, 92.3%2109 of 7166 genes, 29.4% 5.58 × 10 3 0.00000.0000cellular biosynthetic process
12 of 13 genes, 92.3%2725 of 7166 genes, 38.0% 7.22 × 10 3 0.00000.0000cellular nitrogen compound metabolic process
311 of 12 genes, 91.7%367 of 7166 genes, 5.1% 4.78 × 10 10 0.00000.0000rRNA processing
11 of 12 genes, 91.7%423 of 7166 genes, 5.9% 1.98 × 10 9 0.00000.0000rRNA metabolic process
11 of 12 genes, 91.7%482 of 7166 genes, 6.7% 7.29 × 10 9 0.00000.0000ribosome biogenesis
11 of 12 genes, 91.7%492 of 7166 genes, 6.9% 8.94 × 10 9 0.00000.0000ncRNA processing
11 of 12 genes, 91.7%2159 of 7166 genes, 30.1% 1.14 × 10 3 0.00000.0000gene expression
413 of 14 genes, 92.9%204 of 7166 genes, 2.8% 3.84 × 10 18 0.00000.0000cytoplasmic translation
13 of 14 genes, 92.9%820 of 7166 genes, 11.4% 3.38 × 10 10 0.00000.0000translation
13 of 14 genes, 92.9%824 of 7166 genes, 11.5% 3.60 × 10 10 0.00000.0000peptide biosynthetic process
13 of 14 genes, 92.9%841 of 7166 genes, 11.7% 4.70 × 10 10 0.00000.0000peptide metabolic process
13 of 14 genes, 92.9%879 of 7166 genes, 12.3% 8.34 × 10 10 0.00000.0000amide biosynthetic process
512 of 13 genes, 92.3%204 of 7166 genes, 2.8% 1.32 × 10 16 0.00000.0000ribosomal large subunit biogenesis
12 of 13 genes, 92.3%820 of 7166 genes, 11.4% 2.78 × 10 9 0.00000.0000biosynthetic process
12 of 13 genes, 92.3%824 of 7166 genes, 11.5% 2.95 × 10 9 0.00000.0000peptide biosynthetic process
12 of 13 genes, 92.3%841 of 7166 genes, 11.7% 3.77 × 10 9 0.00000.0000ribonucleoprotein complex biogenesis
12 of 13 genes, 92.3%879 of 7166 genes, 12.3% 6.39 × 10 9 0.00000.0000cellular biosynthetic process
FDR: False Discovery Rate; BP: Biological Process; BP is significant at p-value < 10 2 .

Appendix B.2. A Function Enrichment Analysis on DIP Complex

Table A4. Top five protein complexes with significant low p-value detected by WECALM on DIP complex.
Table A4. Top five protein complexes with significant low p-value detected by WECALM on DIP complex.
Complex IDCluster FrequencyGenome Frequencyp-ValueFDRFALSE PositiveGene Ontology Term
111 of 12 genes, 91.7%125 of 7166 genes, 1.7% 9.76 × 10 15 0.00000.0000ribosomal large subunit biogenesis
11 of 12 genes, 91.7%482 of 7166 genes, 6.7% 1.08 × 10 10 0.00000.0000ribosome biogenesis
11 of 12 genes, 91.7%576 of 7166 genes, 8.0% 7.74 × 10 10 0.00000.0000ribonucleoprotein complex biogenesis
11 of 12 genes, 91.7%1272 of 7166 genes, 17.8% 4.49 × 10 6 0.00000.0000cellular component biogenesis
11 of 12 genes, 91.7%2424 of 7166 genes, 33.8% 4.55 × 10 3 0.00080.0000cellular component organization or biogenesis
23 of 4 genes, 75.0%56 of 7166 genes, 0.8% 1.10 × 10 4 0.00000.0000purine ribonucleoside triphosphate metabolic process
3 of 4 genes, 75.0%58 of 7166 genes, 0.8% 1.30 × 10 4 0.00000.0000purine nucleoside triphosphate metabolic process
3 of 4 genes, 75.0%119 of 7166 genes, 1.7% 1.14 × 10 3 0.00000.0000nucleotide biosynthetic process
3 of 4 genes, 75.0%121 of 7166 genes, 1.7% 1.20 × 10 3 0.00000.0000nucleoside phosphate biosynthetic process
3 of 4 genes, 75.0%125 of 7166 genes, 1.7% 1.33 × 10 3 0.00000.0000ribonucleotide metabolic process
39 of 10 genes, 90.0%20 of 7166 genes, 0.3% 9.69 × 10 22 0.00000.0000ATP biosynthetic process
9 of 10 genes, 90.0%20 of 7166 genes, 0.3% 9.69 × 10 22 0.00000.0000proton motive force-driven ATP synthesis
9 of 10 genes, 90.0%24 of 7166 genes, 0.3% 7.54 × 10 21 0.00000.0000purine nucleoside triphosphate biosynthetic process
9 of 10 genes, 90.0%24 of 7166 genes, 0.3% 7.54 × 10 21 0.00000.0000purine ribonucleoside triphosphate biosynthetic process
9 of 10 genes, 90.0%30 of 7166 genes, 0.4% 8.25 × 10 20 0.00000.0000ribonucleoside triphosphate biosynthetic process
410 of 11 genes, 90.9%20 of 7166 genes, 0.3% 1.59 × 10 24 0.00000.0000proton transmembrane transport
10 of 11 genes, 90.9%20 of 7166 genes, 0.3% 1.59 × 10 24 0.00000.0000purine ribonucleotide metabolic process
10 of 11 genes, 90.9%24 of 7166 genes, 0.3% 1.69 × 10 23 0.00000.0000nucleotide biosynthetic process
10 of 11 genes, 90.9%24 of 7166 genes, 0.3% 1.69 × 10 23 0.00000.0000nucleoside phosphate biosynthetic process
10 of 11 genes, 90.9%30 of 7166 genes, 0.4% 2.59 × 10 22 0.00000.0000ribonucleotide metabolic process
59 of 10 genes, 90.0%444 of 7166 genes, 6.2% 1.08 × 10 8 0.00000.0000intracellular protein transport
9 of 10 genes, 90.0%449 of 7166 genes, 6.3% 1.19 × 10 8 0.00000.0000vesicle-mediated transport
9 of 10 genes, 90.0%630 of 7166 genes, 8.8% 2.52 × 10 7 0.00000.0000protein transport
9 of 10 genes, 90.0%651 of 7166 genes, 9.1% 3.38 × 10 7 0.00000.0000establishment of protein localization
9 of 10 genes, 90.0%742 of 7166 genes, 10.4% 1.09 × 10 6 0.00000.0000intracellular transport
FDR: False Discovery Rate; BP: Biological Process; BP is significant at p-value < 10 2

Appendix B.3. Detected Protein Complexes with 100% Cluster Frequency

Table A5 lists the top 10 complexes with 100% cluster frequency detected by WECALM on the BioGRID and DIP complexes.
Table A5. Top 10 protein complexes with 100% cluster frequency detected by WECALM on the BioGRID and DIP complex datasets.
Table A5. Top 10 protein complexes with 100% cluster frequency detected by WECALM on the BioGRID and DIP complex datasets.
DatasetComplex IDCluster FrequencyGenome Frequencyp-Value (BP)FDRFALSE PositiveGene Ontology Term
BioGRID112 of 12 genes, 100.0%122 of 7166 genes, 1.7% 9.16 × 10 15 0.00000.0000mRNA splicing, via spliceosome
242 of 42 genes, 100.0%123 of 7166 genes, 1.7% 1.27 × 10 10 0.00000.0000RNA splicing,
310 of 10 genes, 100.0%10 of 7166 genes, 0.1% 3.64 × 10 10 0.00000.0000spliceosomal tri-snRNP complex assembly
419 of 19 genes, 100.0%132 of 7166 genes, 1.8% 1.49 × 10 7 0.00000.0000RNA splicing, via transesterification reactions
536 of 36 genes, 100.0%157 of 7166 genes, 2.2% 4.55 × 10 3 0.00000.0000RNA splicing
611 of 11 genes, 100.0%20 of 7166 genes, 0.3% 9.29 × 10 21 0.00000.0000spliceosomal snRNP assembly
710 of 10 genes, 100.0%229 of 7166 genes, 3.2% 9.08 × 10 16 0.00000.0000mRNA processing
817 of 17 genes, 100.0%350 of 7166 genes, 4.9% 1.08 × 10 15 0.00000.0200mRNA metabolic process
942 of 42 genes, 100.0%347 of 7166 genes, 4.8% 1.51 × 10 15 0.00000.0000DNA-directed 5’-3’ RNA polymerase activity
1023 of 23 genes, 100.0%34 of 7166 genes, 0.5% 6.25 × 10 20 0.00000.00005’-3’ RNA polymerase activity
DIP119 of 19 genes, 100.0%62 of 7166 genes, 0.9% 4.49 × 10 19 0.00000.0000nucleotide-excision repair
239 of 39 genes, 100.0%234 of 7166 genes, 3.3% 7.76 × 10 19 0.00000.0000ubiquitin-dependent protein catabolic process
336 of 36 genes, 100.0%240 of 7166 genes, 3.3% 9.25 × 10 19 0.00000.0000modification-dependent protein catabolic process
410 of 10 genes, 100.0%262 of 7166 genes, 3.7% 3.40 × 10 18 0.00000.0000modification-dependent macromolecule catabolic process
514 of 14 genes, 100.0%264 of 7166 genes, 3.7% 8.19 × 10 18 0.00000.0000proteolysis involved in protein catabolic process
613 of 13 genes, 100.0%293 of 7166 genes, 4.1% 1.83 × 10 17 0.00000.0000protein catabolic process
715 of 15 genes, 100.0%309 of 7166 genes, 4.3% 3.03 × 10 17 0.00000.0000DNA repair
812 of 12 genes, 100.0%350 of 7166 genes, 4.9% 5.50 × 10 17 0.00000.0000cellular response to DNA damage stimulus
931 of 31 genes, 100.0%407 of 7166 genes, 5.7% 1.07 × 10 16 0.00000.0000organonitrogen compound catabolic process
1017 of 17 genes, 100.0%416 of 7166 genes, 5.8% 3.64 × 10 16 0.00000.0000proteolysis
FDR: False Discovery Rate; BP: Biological Process; BP is significant at p-value < 10 2 .

References

  1. Almeida, R.M.; Dell’Acqua, S.; Krippahl, L.; Moura, J.J.; Pauleta, S.R. Predicting Protein–Protein interactions using bigger: Case studies. Molecules 2016, 21, 1037. [Google Scholar] [CrossRef] [PubMed]
  2. Bustamam, A.; Siswantining, T.; Kaloka, T.P.; Swasti, O. Application of bimax, pols, and lcm-mbc to find bicluster on interactions protein between hiv-1 and human. Austrian J. Stat. 2020, 49, 1–18. [Google Scholar] [CrossRef]
  3. Tripathi, S.; Moutari, S.; Dehmer, M.; Emmert-Streib, F. Comparison of module detection algorithms in protein networks and investigation of the biological meaning of predicted modules. BMC Bioinform. 2016, 17, 129. [Google Scholar] [CrossRef] [PubMed]
  4. Li, X.L.; Ng, S.K. Biological Data Mining in Protein Interaction Networks; IGI Global: Hershey, PA, USA, 2009. [Google Scholar]
  5. Wu, D.; Hu, X. Topological analysis and sub-network mining of Protein–Protein interactions. In Research and Trends in Data Mining Technologies and Applications; IGI Global: Hershey, PA, USA, 2007; pp. 209–240. [Google Scholar]
  6. Larsen, P.E.; Collart, F.; Dai, Y. Incorporating network topology improves prediction of protein interaction networks from transcriptomic data. Int. J. Knowl. Discov. Bioinform. (IJKDB) 2010, 1, 1–19. [Google Scholar] [CrossRef]
  7. Ahnert, S.E.; Marsh, J.A.; Hernández, H.; Robinson, C.V.; Teichmann, S.A. Principles of assembly reveal a periodic table of protein complexes. Science 2015, 350, aaa2245. [Google Scholar] [CrossRef]
  8. Tong, A.H.Y.; Drees, B.; Nardelli, G.; Bader, G.D.; Brannetti, B.; Castagnoli, L.; Evangelista, M.; Ferracuti, S.; Nelson, B.; Paoluzi, S.; et al. A combined experimental and computational strategy to define protein interaction networks for peptide recognition modules. Science 2002, 295, 321–324. [Google Scholar] [CrossRef]
  9. Shen, X.; Yi, L.; Jiang, X.; Zhao, Y.; Hu, X.; He, T.; Yang, J. Neighbor affinity based algorithm for discovering temporal protein complex from dynamic PPI network. Methods 2016, 110, 90–96. [Google Scholar] [CrossRef]
  10. Zhang, X.F.; Dai, D.Q.; Ou-Yang, L.; Yan, H. Detecting overlapping protein complexes based on a generative model with functional and topological properties. BMC Bioinform. 2014, 15, 186. [Google Scholar] [CrossRef]
  11. Shen, X.; Zhou, J.; Yi, L.; Hu, X.; He, T.; Yang, J. Identifying protein complexes based on brainstorming strategy. Methods 2016, 110, 44–53. [Google Scholar] [CrossRef]
  12. Liu, G.; Wong, L.; Chua, H.N. Complex discovery from weighted PPI networks. Bioinformatics 2009, 25, 1891–1897. [Google Scholar] [CrossRef]
  13. Adamcsek, B.; Palla, G.; Farkas, I.J.; Derényi, I.; Vicsek, T. CFinder: Locating cliques and overlapping modules in biological networks. Bioinformatics 2006, 22, 1021–1023. [Google Scholar] [CrossRef]
  14. Van Dongen, S.M. Graph clustering by Flow Simulation. Ph.D. Thesis, University of Utrecht, Utrecht, The Netherlands, 2000. [Google Scholar]
  15. Vlasblom, J.; Wodak, S.J. Markov clustering versus affinity propagation for the partitioning of protein interaction graphs. BMC Bioinform. 2009, 10, 99. [Google Scholar] [CrossRef]
  16. Ochieng, P.J.; Kusuma, W.; Haryanto, T. Detection of protein complex from Protein–Protein interaction network using Markov clustering. In Proceedings of the Journal of Physics: Conference Series; IOP Publishing: Bristol, UK, 2017; Volume 835, p. 012001. [Google Scholar]
  17. Wang, R.; Liu, G.; Wang, C. Identifying protein complexes based on an edge weight algorithm and core-attachment structure. BMC Bioinform. 2019, 20, 471. [Google Scholar] [CrossRef]
  18. Xie, D.; Yi, Y.; Zhou, J.; Li, X.; Wu, H. A novel temporal protein complexes identification framework based on density–Distance and heuristic algorithm. Neural Comput. Appl. 2019, 31, 4693–4701. [Google Scholar] [CrossRef]
  19. Jiang, P.; Singh, M. SPICi: A fast clustering algorithm for large biological networks. Bioinformatics 2010, 26, 1105–1111. [Google Scholar] [CrossRef]
  20. Nepusz, T.; Yu, H.; Paccanaro, A. Detecting overlapping protein complexes in Protein–Protein interaction networks. Nat. Methods 2012, 9, 471–472. [Google Scholar] [CrossRef]
  21. Wang, R.; Liu, G.; Wang, C.; Su, L.; Sun, L. Predicting overlapping protein complexes based on core-attachment and a local modularity structure. BMC Bioinform. 2018, 19, 305. [Google Scholar] [CrossRef]
  22. Wu, M.; Li, X.; Kwoh, C.K.; Ng, S.K. A core-attachment based method to detect protein complexes in PPI networks. BMC Bioinform. 2009, 10, 169. [Google Scholar] [CrossRef]
  23. Leung, H.C.; Xiang, Q.; Yiu, S.M.; Chin, F.Y. Predicting protein complexes from PPI data: A core-attachment approach. J. Comput. Biol. 2009, 16, 133–144. [Google Scholar] [CrossRef]
  24. Hanna, E.M.; Zaki, N. Detecting protein complexes in protein interaction networks using a ranking algorithm with a refined merging procedure. BMC Bioinform. 2014, 15, 204. [Google Scholar] [CrossRef]
  25. Palla, G.; Derényi, I.; Farkas, I.; Vicsek, T. Uncovering the overlapping community structure of complex networks in nature and society. Nature 2005, 435, 814–818. [Google Scholar] [CrossRef] [PubMed]
  26. Karp, R.M. Reducibility among combinatorial problems. In Proceedings of the Complexity of Computer Computations: Proceedings of a symposium on the Complexity of Computer Computations, New York, NY, USA, 20–22 March 1972; Springer: Berlin/Heidelberg, Germany, 1972; pp. 85–103. [Google Scholar]
  27. Gens, G.V.; Levner, E.V. Computational complexity of approximation algorithms for combinatorial problems. In Proceedings of the Mathematical Foundations of Computer Science 1979: Proceedings, 8th Symposium, Olomouc, Czechoslovakia, 3–7 September 1979; Springer: Berlin/Heidelberg, Germany, 1979; pp. 292–300. [Google Scholar]
  28. Spirin, V.; Mirny, L.A. Protein complexes and functional modules in molecular networks. Proc. Natl. Acad. Sci. USA 2003, 100, 12123–12128. [Google Scholar] [CrossRef] [PubMed]
  29. Bader, S.; Kühner, S.; Gavin, A.C. Interaction networks for systems biology. FEBS Lett. 2008, 582, 1220–1224. [Google Scholar] [CrossRef] [PubMed]
  30. Zaki, N.; Efimov, D.; Berengueres, J. Protein complex detection using interaction reliability assessment and weighted clustering coefficient. BMC Bioinform. 2013, 14, 163. [Google Scholar] [CrossRef]
  31. Cao, B.; Luo, J.; Liang, C.; Wang, S.; Ding, P. Pce-fr: A novel method for identifying overlapping protein complexes in weighted Protein–Protein interaction networks using pseudo-clique extension based on fuzzy relation. IEEE Trans. Nanobiosci. 2016, 15, 728–738. [Google Scholar] [CrossRef]
  32. Wang, J.; Chen, G.; Liu, B.; Li, M.; Pan, Y. Identifying protein complexes from interactome based on essential proteins and local fitness method. IEEE Trans. Nanobiosci. 2012, 11, 324–335. [Google Scholar] [CrossRef]
  33. Kreimer, A.; Borenstein, E.; Gophna, U.; Ruppin, E. The evolution of modularity in bacterial metabolic networks. Proc. Natl. Acad. Sci. USA 2008, 105, 6976–6981. [Google Scholar] [CrossRef]
  34. Luo, F.; Yang, Y.; Chen, C.F.; Chang, R.; Zhou, J.; Scheuermann, R.H. Modular organization of protein interaction networks. Bioinformatics 2007, 23, 207–214. [Google Scholar] [CrossRef]
  35. Poyatos, J.F.; Hurst, L.D. How biologically relevant are interaction-based modules in protein networks? Genome Biol. 2004, 5, R93. [Google Scholar] [CrossRef]
  36. Ren, J.; Wang, J.; Li, M.; Wang, L. Identifying protein complexes based on density and modularity in Protein–Protein interaction network. BMC Syst. Biol. 2013, 7, S12. [Google Scholar] [CrossRef]
  37. Bóta, A.; Csizmadia, L.; Pluhár, A. Community detection and its use in Real Graphs. In Proceedings of the 2010 Mini-Conference on Applied Theoretical Computer Science , Koper, Slovenia, 13–14 October 2010. [Google Scholar]
  38. Gera, I.; London, A.; Pluhár, A. Greedy algorithm for edge-based nested community detection. In Proceedings of the 2022 IEEE 2nd Conference on Information Technology and Data Science (CITDS), Debrecen, Hungary, 16–18 May 2022; pp. 86–91. [Google Scholar]
  39. Dezso, Z.; Oltvai, Z.N.; Barabási, A.L. Bioinformatics analysis of experimentally determined protein complexes in the yeast Saccharomyces cerevisiae. Genome Res. 2003, 13, 2450–2454. [Google Scholar] [CrossRef]
  40. Pu, S.; Vlasblom, J.; Emili, A.; Greenblatt, J.; Wodak, S.J. Identifying functional modules in the physical interactome of Saccharomyces cerevisiae. Proteomics 2007, 7, 944–960. [Google Scholar] [CrossRef]
  41. Gavin, A.C.; Aloy, P.; Grandi, P.; Krause, R.; Boesche, M.; Marzioch, M.; Rau, C.; Jensen, L.J.; Bastuck, S.; Dümpelfeld, B.; et al. Proteome survey reveals modularity of the yeast cell machinery. Nature 2006, 440, 631–636. [Google Scholar] [CrossRef]
  42. Bruckner, S.; Hüffner, F.; Komusiewicz, C. A graph modification approach for finding core–periphery structures in protein interaction networks. Algorithms Mol. Biol. 2015, 10, 16. [Google Scholar] [CrossRef]
  43. Meng, X.; Li, W.; Peng, X.; Li, Y.; Li, M. Protein interaction networks: Centrality, modularity, dynamics, and applications. Front. Comput. Sci. 2021, 15, 156902. [Google Scholar] [CrossRef]
  44. Ma, X.; Gao, L. Predicting protein complexes in protein interaction networks using a core-attachment algorithm based on graph communicability. Inf. Sci. 2012, 189, 233–254. [Google Scholar] [CrossRef]
  45. Mete, M.; Tang, F.; Xu, X.; Yuruk, N. A structural approach for finding functional modules from large biological networks. BMC Bioinform. 2008, 9, S19. [Google Scholar] [CrossRef]
  46. Yang, J.; Leskovec, J. Overlapping communities explain core–Periphery organization of networks. Proc. IEEE 2014, 102, 1892–1902. [Google Scholar] [CrossRef]
  47. Vieira, V.d.F.; Xavier, C.R.; Evsukoff, A.G. A comparative study of overlapping community detection methods from the perspective of the structural properties. Appl. Netw. Sci. 2020, 5, 51. [Google Scholar] [CrossRef]
  48. Gu, L.; Han, Y.; Wang, C.; Chen, W.; Jiao, J.; Yuan, X. Module overlapping structure detection in PPI using an improved link similarity-based Markov clustering algorithm. Neural Comput. Appl. 2019, 31, 1481–1490. [Google Scholar] [CrossRef]
  49. Wang, Y.; Qian, X. Functional module identification in protein interaction networks by interaction patterns. Bioinformatics 2014, 30, 81–93. [Google Scholar] [CrossRef] [PubMed]
  50. Aloy, P.; Bottcher, B.; Ceulemans, H.; Leutwein, C.; Mellwig, C.; Fischer, S.; Gavin, A.C.; Bork, P.; Superti-Furga, G.; Serrano, L.; et al. Structure-based assembly of protein complexes in yeast. Science 2004, 303, 2026–2029. [Google Scholar] [CrossRef] [PubMed]
  51. Luo, F.; Li, B.; Wan, X.F.; Scheuermann, R.H. Core and periphery structures in protein interaction networks. BMC Bioinform. 2009, 10, S8. [Google Scholar] [CrossRef] [PubMed]
  52. Bader, G.D.; Hogue, C.W. Analyzing yeast protein–protein interaction data obtained from different sources. Nat. Biotechnol. 2002, 20, 991–997. [Google Scholar] [CrossRef]
  53. Bader, G.D.; Hogue, C.W. An automated method for finding molecular complexes in large protein interaction networks. BMC Bioinform. 2003, 4, 2. [Google Scholar] [CrossRef]
  54. Kourtellis, N.; Alahakoon, T.; Simha, R.; Iamnitchi, A.; Tripathi, R. Identifying high betweenness centrality nodes in large social networks. Soc. Netw. Anal. Min. 2013, 3, 899–914. [Google Scholar] [CrossRef]
  55. Barabasi, A.L.; Oltvai, Z.N. Network biology: Understanding the cell’s functional organization. Nat. Rev. Genet. 2004, 5, 101–113. [Google Scholar] [CrossRef]
  56. Gosak, M.; Markovič, R.; Dolenšek, J.; Rupnik, M.S.; Marhl, M.; Stožer, A.; Perc, M. Network science of biological systems at different scales: A review. Phys. Life Rev. 2018, 24, 118–135. [Google Scholar] [CrossRef]
  57. Han, J.D.J. Understanding biological functions through molecular networks. Cell Res. 2008, 18, 224–237. [Google Scholar] [CrossRef]
  58. Del Sol, A.; O’Meara, P. Small-world network approach to identify key residues in protein–protein interaction. Proteins Struct. Funct. Bioinform. 2005, 58, 672–682. [Google Scholar] [CrossRef]
  59. Del Sol, A.; Fujihashi, H.; O’Meara, P. Topology of small-world networks of protein–protein complex structures. Bioinformatics 2005, 21, 1311–1315. [Google Scholar] [CrossRef]
  60. Wang, X.; Li, L.; Cheng, Y. An overlapping module identification method in Protein–Protein interaction networks. BMC Bioinform. 2012, 13, S4. [Google Scholar] [CrossRef]
  61. Liu, C.; Li, J.; Zhao, Y. Exploring hierarchical and overlapping modular structure in the yeast protein interaction network. BMC Genom. 2010, 11, S17. [Google Scholar] [CrossRef]
  62. Jaccard, P. The distribution of the flora in the alpine zone. 1. New Phytol. 1912, 11, 37–50. [Google Scholar] [CrossRef]
  63. Goodrich, M.T.; Ozel, E. Modeling the small-world phenomenon with road networks. In Proceedings of the 30th International Conference on Advances in Geographic Information Systems, Seattle, WA, USA, 1–4 November 2022; pp. 1–10. [Google Scholar]
  64. Menezes, M.B.; Kim, S.; Huang, R. Constructing a Watts-Strogatz network from a small-world network with symmetric degree distribution. PLoS ONE 2017, 12, e0179120. [Google Scholar] [CrossRef]
  65. Zahiri, J.; Emamjomeh, A.; Bagheri, S.; Ivazeh, A.; Mahdevar, G.; Tehrani, H.S.; Mirzaie, M.; Fakheri, B.A.; Mohammad-Noori, M. Protein complex prediction: A survey. Genomics 2020, 112, 174–183. [Google Scholar] [CrossRef]
  66. Lensink, M.F.; Velankar, S.; Wodak, S.J. Modeling protein–protein and protein–peptide complexes: CAPRI 6th edition. Proteins Struct. Funct. Bioinform. 2017, 85, 359–377. [Google Scholar] [CrossRef]
  67. Xenarios, I.; Salwinski, L.; Duan, X.J.; Higney, P.; Kim, S.M.; Eisenberg, D. DIP, the Database of Interacting Proteins: A research tool for studying cellular networks of protein interactions. Nucleic Acids Res. 2002, 30, 303–305. [Google Scholar] [CrossRef] [PubMed]
  68. Stark, C.; Breitkreutz, B.J.; Reguly, T.; Boucher, L.; Breitkreutz, A.; Tyers, M. BioGRID: A general repository for interaction datasets. Nucleic Acids Res. 2006, 34, D535–D539. [Google Scholar] [CrossRef]
  69. Ma, C.Y.; Chen, Y.P.P.; Berger, B.; Liao, C.S. Identification of protein complexes by integrating multiple alignment of protein interaction networks. Bioinformatics 2017, 33, 1681–1688. [Google Scholar] [CrossRef]
  70. Pu, S.; Wong, J.; Turner, B.; Cho, E.; Wodak, S.J. Up-to-date catalogues of yeast protein complexes. Nucleic Acids Res. 2009, 37, 825–831. [Google Scholar] [CrossRef]
  71. Mewes, H.W.; Frishman, D.; Mayer, K.F.; Münsterkötter, M.; Noubibou, O.; Pagel, P.; Rattei, T.; Oesterheld, M.; Ruepp, A.; Stümpflen, V. MIPS: Analysis and annotation of proteins from whole genomes in 2005. Nucleic Acids Res. 2006, 34, D169–D172. [Google Scholar] [CrossRef] [PubMed]
  72. Luc, P.V.; Tempst, P. PINdb: A database of nuclear protein complexes from human and yeast. Bioinformatics 2004, 20, 1413–1415. [Google Scholar] [CrossRef] [PubMed]
  73. Kanehisa, M.; Goto, S.; Sato, Y.; Furumichi, M.; Tanabe, M. KEGG for integration and interpretation of large-scale molecular data sets. Nucleic Acids Res. 2012, 40, D109–D114. [Google Scholar] [CrossRef]
  74. Dwight, S.S.; Harris, M.A.; Dolinski, K.; Ball, C.A.; Binkley, G.; Christie, K.R.; Fisk, D.G.; Issel-Tarver, L.; Schroeder, M.; Sherlock, G.; et al. Saccharomyces Genome Database (SGD) provides secondary gene annotation using the Gene Ontology (GO). Nucleic Acids Res. 2002, 30, 69–72. [Google Scholar] [CrossRef]
  75. Li, X.; Wu, M.; Kwoh, C.K.; Ng, S.K. Computational approaches for detecting protein complexes from protein interaction networks: A survey. BMC Genom. 2010, 11, S3. [Google Scholar] [CrossRef]
  76. Li, M.; Chen, J.e.; Wang, J.x.; Hu, B.; Chen, G. Modifying the DPClus algorithm for identifying protein complexes based on new topological structures. BMC Bioinform. 2008, 9, 398. [Google Scholar] [CrossRef]
  77. Brohee, S.; Van Helden, J. Evaluation of clustering algorithms for Protein–Protein interaction networks. BMC Bioinform. 2006, 7, 488. [Google Scholar] [CrossRef]
  78. Li, X.L.; Foo, C.S.; Ng, S.K. Discovering protein complexes in dense reliable neighborhoods of protein interaction networks. In Computational Systems Bioinformatics: (Volume 6); World Scientific: Singapore, 2007; pp. 157–168. [Google Scholar]
  79. Friedel, C.C.; Krumsiek, J.; Zimmer, R. Bootstrapping the interactome: Unsupervised identification of protein complexes in yeast. J. Comput. Biol. 2009, 16, 971–987. [Google Scholar] [CrossRef]
  80. Maulik, U.; Mukhopadhyay, A.; Bhattacharyya, M.; Kaderali, L.; Brors, B.; Bandyopadhyay, S.; Eils, R. Mining quasi-bicliques from HIV-1-human protein interaction network: A multiobjective biclustering approach. IEEE/ACM Trans. Comput. Biol. Bioinform. 2012, 10, 423–435. [Google Scholar] [CrossRef]
  81. Cao, B.; Luo, J.; Liang, C.; Wang, S. Identifying protein complexes by combining network topology and biological characteristics. J. Comput. Theor. Nanosci. 2016, 13, 7666–7675. [Google Scholar] [CrossRef]
  82. Wu, Z.; Liao, Q.; Liu, B. idenPC-MIIP: Identify protein complexes from weighted PPI networks using mutual important interacting partner relation. Briefings Bioinform. 2021, 22, 1972–1983. [Google Scholar] [CrossRef]
  83. Cherry, J.M.; Adler, C.; Ball, C.; Chervitz, S.A.; Dwight, S.S.; Hester, E.T.; Jia, Y.; Juvik, G.; Roe, T.; Schroeder, M.; et al. SGD: Saccharomyces genome database. Nucleic Acids Res. 1998, 26, 73–79. [Google Scholar] [CrossRef]
  84. Li, B.; Liao, B. Protein complexes prediction method based on core—Attachment structure and functional annotations. Int. J. Mol. Sci. 2017, 18, 1910. [Google Scholar] [CrossRef]
  85. Xiao, Q.; Luo, P.; Li, M.; Wang, J.; Wu, F.X. A Novel Core-Attachment–Based Method to Identify Dynamic Protein Complexes Based on Gene Expression Profiles and PPI Networks. Proteomics 2019, 19, 1800129. [Google Scholar] [CrossRef]
Figure 1. A general structure of a PPI network comprised of two complexes, overlapping proteins, and peripheral and interspersed proteins.
Figure 1. A general structure of a PPI network comprised of two complexes, overlapping proteins, and peripheral and interspersed proteins.
Applsci 13 06388 g001
Figure 2. A simple graph representation of the PPI network structure is shown in Figure 1. From the network, v represents the nodes or proteins and w represents the weight of edges or the confidence score.
Figure 2. A simple graph representation of the PPI network structure is shown in Figure 1. From the network, v represents the nodes or proteins and w represents the weight of edges or the confidence score.
Applsci 13 06388 g002
Figure 3. A comparison of the performance of WECALM and other existing algorithms on NewMIPS complexes. (A): The BioGRID dataset and (B): The DIP dataset. Evaluation matrices include; Recall, Precision, F-measure, coverage rate (CR), and the Maximum Matching Ratio (MMR). The overall composite score is determined by the length of the bar. The longer the bar, the better an algorithm’s overall performance is.
Figure 3. A comparison of the performance of WECALM and other existing algorithms on NewMIPS complexes. (A): The BioGRID dataset and (B): The DIP dataset. Evaluation matrices include; Recall, Precision, F-measure, coverage rate (CR), and the Maximum Matching Ratio (MMR). The overall composite score is determined by the length of the bar. The longer the bar, the better an algorithm’s overall performance is.
Applsci 13 06388 g003
Figure 4. A comparison of the performance of WECALM and other existing algorithms on CYC2008 complexes. (A): The BioGRID dataset and (B): The DIP dataset. Evaluation matrices include; Recall, Precision, F-measure, coverage rate (CR), and the Maximum Matching Ratio (MMR). The overall composite score is determined by the length of the bar. The longer the bar, the better an algorithm’s overall performance is.
Figure 4. A comparison of the performance of WECALM and other existing algorithms on CYC2008 complexes. (A): The BioGRID dataset and (B): The DIP dataset. Evaluation matrices include; Recall, Precision, F-measure, coverage rate (CR), and the Maximum Matching Ratio (MMR). The overall composite score is determined by the length of the bar. The longer the bar, the better an algorithm’s overall performance is.
Applsci 13 06388 g004
Figure 5. The effect of π on the performance of WECALM based on Recall, Precision, F-measure, CR, and MMR matrices. π is the predefined overlapping threshold. (A): The performance on the BioGRID based on NewMIPS. (B): The performance on DIP based on NewMIPS. (C): The performance on BioGRID based on CYC2008. (D): performance on DIP based on CYC2008. The MMR and F-measure are maximum when π = 0.2 and π = 0.4 , respectively for the BioGRID on both NewMIPS and CYC2008.
Figure 5. The effect of π on the performance of WECALM based on Recall, Precision, F-measure, CR, and MMR matrices. π is the predefined overlapping threshold. (A): The performance on the BioGRID based on NewMIPS. (B): The performance on DIP based on NewMIPS. (C): The performance on BioGRID based on CYC2008. (D): performance on DIP based on CYC2008. The MMR and F-measure are maximum when π = 0.2 and π = 0.4 , respectively for the BioGRID on both NewMIPS and CYC2008.
Applsci 13 06388 g005
Figure 6. The effect η on the performance of WECALM based on Recall, Precision, F-measure, CR, and MMR matrices. η is a predefined local modularity threshold. (A): performance on BioGRID based on NewMIPS. (B): performance on DIP based on NewMIPS. (C): performance on BioGRID based on CYC2008. (D): performance on DIP based on CYC2008. The precision and F-measure is maximum when η = 0.8 and η = 0.4 respectively on both the NewMIPS and CYC2008.
Figure 6. The effect η on the performance of WECALM based on Recall, Precision, F-measure, CR, and MMR matrices. η is a predefined local modularity threshold. (A): performance on BioGRID based on NewMIPS. (B): performance on DIP based on NewMIPS. (C): performance on BioGRID based on CYC2008. (D): performance on DIP based on CYC2008. The precision and F-measure is maximum when η = 0.8 and η = 0.4 respectively on both the NewMIPS and CYC2008.
Applsci 13 06388 g006
Figure 7. The effect of ω on the performance of WECALM based on Recall, Precision, F-measure, CR, and MMR matrices. is a predefined core structural similarity threshold. (A): The performance on BioGRID based on NewMIPS. (B): The performance on DIP based on NewMIPS. (C): The performance on BioGRID based on CYC2008. (D): The performance on DIP based on CYC2008. The precision and F-measure are maximum when ω = 0.8 and ω = 0.4 , respectively, on both NewMIPS and CYC2008.
Figure 7. The effect of ω on the performance of WECALM based on Recall, Precision, F-measure, CR, and MMR matrices. is a predefined core structural similarity threshold. (A): The performance on BioGRID based on NewMIPS. (B): The performance on DIP based on NewMIPS. (C): The performance on BioGRID based on CYC2008. (D): The performance on DIP based on CYC2008. The precision and F-measure are maximum when ω = 0.8 and ω = 0.4 , respectively, on both NewMIPS and CYC2008.
Applsci 13 06388 g007
Table 1. The general details of PPI networks used for the simulation.
Table 1. The general details of PPI networks used for the simulation.
DatasetsNumber of ProteinNumber of EdgesNetwork Density
BioGRID564059,748 3.16 × 10 6
DIP493017,202 1.42 × 10 3
Human15,459144,687 1.21 × 10 3
Yeast619474,826 3.90 × 10 3
Table 2. The details of the benchmark protein complexes.
Table 2. The details of the benchmark protein complexes.
Complex
Datasets
Number of
Protein Complexes
Overlapping
Complexes
Non-Overlapping
Complexes
Protein
Coverage
Average
Size
NewMIPS32828345117114.93
CYC200823610812816284.71
Human complexes2289--62068.57
Yeast complexes1045--27738.92
Table 3. An evaluation of computational complexity and accuracy of WECALM and other algorithms.
Table 3. An evaluation of computational complexity and accuracy of WECALM and other algorithms.
DatasetAlgorithm C P ( v ) F-MeasureCRMMRSepACCCPU Run
Time (s)
HumanMCL3150.10010.17590.01050.17530.21675906.34
COACH4484o.24550.54080.06770.52160.27772851.05
EWCA19790.40480.52210.09640.60810.522129.37
CFinder4490.12560.28340.01160.39120.25113896.35
GMFTP7730.26510.41930.04190.49170.3852254.67
Core5760.16210.32670.12670.35730.27782853.14
CALM11080.51270.51820.13940.68940.5289198.39
ClusterONE3750.10260.30710.02070.37730.29754895.78
CMC6720.12510.25030.01830.29750.33133904.83
ProRank+8380.36510.28560.06870.55260.5613282.66
WECALM23670.42550.51550.09810.61550.621928.45
YeastMCL2980.11040.27610.01170.16250.13954967.47
COACH15510.20830.55210.04660.35830.31173603.31
EWCA9360.41990.61820.09820.59040.587918.54
CFinder3510.14290.27490.02810.34530.41633432.07
GMFTP6750.27630.31290.03090.51450.4092229.89
Core4020.21240.29680.32850.15170.32182543.34
CALM7320.40150.67870.14330.62610.6532154.89
ClusterONE3170.20120.27670.02850.33710.32553989.92
CMC5890.21150.19750.01980.29340.35532987.63
ProRank+5160.27120.28160.04870.54710.5602251.54
WECALM18910.42160.63940.04870.641310.653417.65
C P ( v ) : Detected Protein Complex; CR: Coverage Rate; MMR: Maximum Match Ratio; Sep: Separation; ACC: Geometrical Mean Accuracy.
Table 4. A function enrichment analysis of protein complexes detected on BioGRID and DIP complexes.
Table 4. A function enrichment analysis of protein complexes detected on BioGRID and DIP complexes.
DatasetAlgorithm C P ( v ) p 10 15 p 10 10 p 10 5 p 10 2 Significant Detected C P ( v )
BioGRIDMCL12141 (33.88%)28 (23.14%)26 (21.49%)12 (9.92%)107 (88.43%)
COACH16676 (45.78%)32 (19.28%)37 (22.29%)16 (9.64%)161 (96.98%)
EWCA1388658 (47.41%)211 (15.20%)299 (21.54%)173 (12.46%)1341 (96.61%)
CFinder352103 (29.26%)53 (15.10%)78 (22.16%)35 (9.94%)269 (76.42%)
GMFTP59773 (12.23%)59 (9.88%)156 (26.13%)161 (26.97%)449 (75.21%)
Core576255 (44.27%)105 (18.23%)68 (11.81%)35 (6.08%)463 (80.38%)
CALM1108587 (52.98%)236 (21.29%)116 (10.47%)96 (8.66%)1035 (93.41%)
ClusterONE294107 (36.40%)35 (11.91%)43 (14.62%)25 (8.50%)210 (71.43%)
CMC1113125 (11.23%)89 (7.99%)258 (23.18%)360 (32.34%)832 (74.75%)
ProRank+746479 (64.21%)105 (14.08%)97 (13.00%)47 (6.30%)728 (97.59%)
WECALM1412687 (48.65%)217 (15.37%)312 (22.09%)172 (12.18%)1388 (98.30%)
DIPMCL14241 (28.87%)29 (20.42%)17 (11.97%)26 (18.31%)113 (79.58%)
COACH32921 (6.38%)25 (7.59%)66 (20.06%)32 (9.73%)144 (43.77%)
EWCA964188 (19.50%)126 (13.07%)319 (33.09%)236 (24.48%)869 (90.15%)
CFinder352157 (44.60%)39 (11.08%)31 (8.81%)45 (12.78%)272 (77.27%)
GMFTP54843 (7.85%)36 (6.57%)105 (19.16%)166 (30.29%)350 (63.87%)
Core412131 (31.79%)87 (21.12%)52 (12.62%)45 (10.922%)315 (76.46%)
CALM755256 (33.91%)127 (16.82%)112 (14.83%)108 (14.31%)603 (80.53%)
ClusterONE315119 (37.78%)49 (15.56%)38 (12.06%)29 (9.21%)235 (74.60%)
CMC3033 (0.99%)8 (2.64%)58 (19.14%)77 (25.41%)146 (48.18%)
ProRank+33874 (21.89%)77 (22.78%)125 (36.98%)41 (12.13%)319 (93.79%)
WECALM1018269 (26.42%)187 (18.37%)358 (35.17%)165 (16.21%)979 (96.17%)
C P ( v ) : Protein Complex; p : p-value.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Ochieng, P.J.; Dombi, J.; Kalmár, T.; Krész, M. A Special Structural Based Weighted Network Approach for the Analysis of Protein Complexes. Appl. Sci. 2023, 13, 6388. https://doi.org/10.3390/app13116388

AMA Style

Ochieng PJ, Dombi J, Kalmár T, Krész M. A Special Structural Based Weighted Network Approach for the Analysis of Protein Complexes. Applied Sciences. 2023; 13(11):6388. https://doi.org/10.3390/app13116388

Chicago/Turabian Style

Ochieng, Peter Juma, József Dombi, Tibor Kalmár, and Miklós Krész. 2023. "A Special Structural Based Weighted Network Approach for the Analysis of Protein Complexes" Applied Sciences 13, no. 11: 6388. https://doi.org/10.3390/app13116388

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop