Simplified Swarm Optimization-based Function Module Detection in Protein–protein Interaction Networks

Proteomics research has become one of the most important topics in the field of life science and natural science. At present, research on protein–protein interaction networks (PPIN) mainly focuses on detecting protein complexes or function modules. However, existing approaches are either ineffective or incomplete. In this paper, we investigate detection mechanisms of functional modules in PPIN, including open database, existing detection algorithms, and recent solutions. After that, we describe the proposed approach based on the simplified swarm optimization (SSO) algorithm and the knowledge of Gene Ontology (GO). The proposed solution implements the SSO algorithm for clustering proteins with similar function, and imports biological gene ontology knowledge for further identifying function complexes and improving detection accuracy. Furthermore, we use four different categories of species datasets for experiment: fruitfly, mouse, scere, and human. The testing and analysis result show that the proposed solution is feasible, efficient, and could achieve a higher accuracy of prediction than existing approaches.


Introduction
Proteomics is one of the most important topics in the fields of life science and natural science [1][2][3][4][5].Considering that proteins alone rarely exhibit their biological functions in individuals, the understanding of protein-protein interactions (PPI) [6] is the basis of revealing the activity of protein and promotes the study of various diseases and development of new drugs.
In the past 10 years, substantial work was conducted to promote the research in the field of PPI, such as publications in Nature [2] and Science [3], proceedings of the National Academy of Sciences [3], and nucleic acid research [7].Available data on PPI are greatly enriched because of the fast development of high-throughput screening [7] and data mining technologies [8].Some widely used and most complete open datasets are also released, for instance, the Biomolecular Interaction Network Database (BIND) [9], Database of Interaction Proteins (DIP) [10], IntAct [11], Human Protein Reference Database (HPRD) [12], and Molecular Interaction Database (MINT) [13,14].
However, the existing solution is incomplete or inaccurate due to the following technical challenges.On one hand, high-throughput screening technology generates a huge amount of noisy data and higher false positive rates, while the experimental method loses lots of real interactions (false Appl.Sci.2017, 7, 412 2 of 12 negatives) [15].On the other hand, the existing computation approaches (described in Section 2) are inefficient, computationally complex, or lack of convincing results.In this paper, we investigate protein function module detection and propose a lightweight and efficient simplified swarm optimization-based protein function module detection with the following contributions: 1.
We investigate PPI datasets and existing function module detection methods and select four typical species of protein-protein interaction data from the DIP database for the experiment.
A specific data crawler is developed to extract data features from these datasets.2.
The proposed PPIN function module detection is described from a few aspects: system model, feature selection, mathematical description, model optimization, etc.The proposed solution implements an SSO algorithm for clustering proteins with similar function and imports biological gene ontology knowledge for further identification.

3.
Experiments are conducted to validate feasibility and efficiency of the proposed approaches.The evaluation of "degree of polymerization" and "similarity between classes" further proves the precision improvement and correctness of our proposed solution.
The paper is organized as follows.Section 2 introduces the existing research, including the graph theory-, machine learning-, and intelligent algorithm-based approaches.Section 3 describes our solution, including the system model, feature extraction, and SSO-based approach.The dataset and result evaluation are explained in Section 4. Finally, experiment conclusions are presented in Section 5.

Protein-Protein Interaction Datasets
There are a few protein-protein interaction datasets described in the following: BIND (Biomolecular Interaction Network Database) contains the known interactions among biological molecules, not only among proteins but also between proteins and DNA, RNA, small molecules, lipids, and carbohydrate substances.BIND is updated daily and has extensive coverage, including human, fruit flies, yeast, nematodes, and other species.DIP was created to establish a simple, easy-to-use, and highly credible PPI public database.It specializes in storing binary PPIs from the literature conf irmed by experiments, as well as the protein complexes from Protein Data Bank (PDB).
IntAct (Molecular Interaction Database) mainly records binary interactions and their experiment methods, experimental conditions, and interaction domain structures in people, yeast, fruit flies, escherichia coli, and other species.IntAct query is divided into basic query and advanced query (more accurate).
HPRD (Human Protein Reference Database) contains protein annotations, PPIs, posttranslational modifications, subcellular localizations, and other comprehensive information.
MINT (Molecular Interaction Database) mainly stores physical interaction of proteins, particularly PPIs of mammals.Besides, it also contains the PPIs of yeast, fruit flies, and viruses.Considering the deviation of definition and the promiscuity in different databases, Gene Ontology (GO), which is developed and maintained by the GO Consortium, should be introduced for the sharing and interoperability among bioinformatics data.Therefore, the retrieval results among different databases could be unified.

Existing Works
Existing works in the detection of protein function modules could be divided into three categories: Graph theory-based approach.Similar to a computer network, the graph theory is introduced to improve the detection of protein functional modules, mainly based on three approaches: hierarchical algorithm, partitioning algorithm, and density algorithm [16].The hierarchical algorithm (such as, modularity division-based method [17], etc.) is based on similarity of the connections between each node.For the partitioning algorithm, the most representative method is based on restricted Appl.Sci.2017, 7, 412 3 of 12 neighborhood search clustering (RNSC) [18].Although both hierarchical and partitioning algorithms are easy to understand and implement, the clustering number should be determined beforehand and the function modules cannot be overlapped.
Machine learning-based approaches.Considering the disadvantages (poor scalability and low clustering) of original Markov Clustering (MCL) algorithm, Lei et al. (2015) proposed an improved MCL clustering algorithm for PPIN [19] via importing two parameters: punishment and mutation factors.This approach improves convergence speed but leads to substantial computation complexity.In literature [20], Deddy proposed a rapid and lightweight hidden layer neural network prediction algorithm based on the Extreme Learning Machine (ELM) algorithm.It uses the speed advantage of the ELM algorithm and achieves better protein function module prediction results.
Intelligence algorithm-based approaches.Swarm intelligence algorithms are also implemented for PPIN function module detection.Examples of these algorithms include Ant Colony Optimization (ACO) [21], Particle Swarm Optimization (PSO) [22], and Artificial Bee Colony (ABC) [23].Sallim applied ACO algorithm to the PPIN complex clustering problem, and further proposed the optimization strategies in protein interaction networks [24].In 2012, Ji introduced a novel ACO-based functional module detection (NACO-FDM) [25] algorithm to improve the efficiency in searching for an optimal path.However, this algorithm easily falls into the local optimum.In literature, another ACO-MAE mechanism that combines ACO with the idea of multi-agent evolution (MAE) was developed to achieve better prediction accuracy.
Besides, some other works have demonstrated higher performance with a mixture of graph-based approach and machine learning-based approaches [26,27].The algorithm is based on prior calculation of parameters on the protein residue networks and later machine learning.However, this approach still needs a lot of noted samples for training.

Clustering Evaluation
In this paper, the clustering evaluation indices include the degree of polymerization (cohesion) inside protein function module, and the deviation degree between modules (separation).Cohesion refers to the similarity degree of each data object in the same category.The higher the degree of polymerization, the higher the similarity.Separation refers to the dissimilarity between two different protein function modules.The higher degree of separation between two categories, the higher the distance between two cluster centers.From mathematics, the cohesion and separation functions are described as follows: where S and D represent the values of protein nodes in the functional similarity matrix and distance matrix, respectively.

Discussion
However, there are still a few disadvantages in existing research.Graph-based approaches are far from being precise (with a highest precision rate of 46% [16]) because some clusters may be too thin due to the considerable weights between loosely connected nodes.Existing machine/deep learning-based approaches need huge amount of denoted sample for training, and this is difficult to implement in PPIN field.On the contrary, although intelligence algorithms (mainly ACO-related approaches) implemented have shown better precision rates and efficiency than graph-based solutions, more intelligent algorithms should be considered and implemented.
The detection of protein function modules is an NP hard problem [28].Since the PSO-related solution is efficient and has been implemented in different kinds of NP hard problems, we propose the enhancement of PSO, Simplified Swarm Optimization (SSO) algorithms, for implementation in the detection of protein function modules.Theoretically, an SSO-based solution is capable of achieving better precision than the PSO algorithm and reduces computing complexity.

Simplified Swarm Optimization-Based Detection
In this section, we describe the proposed SSO-based solution in four steps: system model, feature extraction, mathematical description, and model optimization.

Interaction Model
Figure 1a-e illustrates the evolution process of protein function module detection.First, the PPI network is abstracted into the format of a protein distance matrix.The structure model is built by the measure of distance between each protein.Afterwards, the SSO algorithm is imported to search the shortest path between each node.Finally, the cutting and filtering strategies are defined and implemented to generate a clustering result.The detection of protein function modules is an NP hard problem [28].Since the PSO-related solution is efficient and has been implemented in different kinds of NP hard problems, we propose the enhancement of PSO, Simplified Swarm Optimization (SSO) algorithms, for implementation in the detection of protein function modules.Theoretically, an SSO-based solution is capable of achieving better precision than the PSO algorithm and reduces computing complexity.

Simplified Swarm Optimization-Based Detection
In this section, we describe the proposed SSO-based solution in four steps: system model, feature extraction, mathematical description, and model optimization.

Interaction Model
Figure 1a-e illustrates the evolution process of protein function module detection.First, the PPI network is abstracted into the format of a protein distance matrix.The structure model is built by the measure of distance between each protein.Afterwards, the SSO algorithm is imported to search the shortest path between each node.Finally, the cutting and filtering strategies are defined and implemented to generate a clustering result.

Feature Extraction
After acquiring a dataset from DIP and GO databases, features can be extracted in the following steps: 1. Noise Filter.Noise data refers to the existence of errors, redundant data, or abnormal data in crawled data.For example, in interaction.xml-basedcrawled data, the tag field "DIP-nnE" may be empty or not found.Therefore, eliminating noise and redundant data is the first step before the experiment.2. Feature Selection.Feature selection is performed through the manual respection of protein xml data.For example, in the main part of the XML file, the tag names "interactorList" and "interactionList" indicate the interaction relationship among protein nodes.Therefore, feature data are selected through the manual inspection of protein data.3. Feature Extraction and Reformat.After the feature selection, related data (e.g., protein id and interactor id) are extracted, reformatted, and stored in the structured database.

Feature Extraction
After acquiring a dataset from DIP and GO databases, features can be extracted in the following steps: 1. Noise Filter.Noise data refers to the existence of errors, redundant data, or abnormal data in crawled data.For example, in interaction.xml-basedcrawled data, the tag field "DIP-nnE" may be empty or not found.Therefore, eliminating noise and redundant data is the first step before the experiment.2.
Feature Selection.Feature selection is performed through the manual respection of protein xml data.For example, in the main part of the XML file, the tag names "interactorList" and "interactionList" indicate the interaction relationship among protein nodes.Therefore, feature data are selected through the manual inspection of protein data.

3.
Feature Extraction and Reformat.After the feature selection, related data (e.g., protein id and interactor id) are extracted, reformatted, and stored in the structured database.

Model Establishment
Assume the initial particle swarm size n, the problem space dimensions m, the location X t i = x t i1 , x t i2 , • • • , x t im , where i = 1, 2, 3, . . .n, x t im is the value of i-th particle with respect to m-th dimension of feature space at time t; the particles in the search process reach the optimal location and are marked as p t i ; the optimal location for the group is g t i .Therefore, the location of particle i in j dimension at time t is described in the following formulas: ( In Equation ( 3), X represents the new value of the particle in every dimension randomly generated; the random number is between (0, 1); C w , C p , and C g are the three predetermined positive constants with C w < C p < C g .
In this study, we use the topological structure of a PPIN [29] as the basis, with the individual proteins as nodes and the interactions between proteins as lines, to construct a PPIN model (with interaction model shown in Figure 2, and adjacency matrix in Table 1).The interaction between the proteins is denoted as 1, whereas no relationship between proteins is denoted as 0.

Model Establishment
Assume the initial particle swarm size n, the problem space dimensions m, the location ( ) x is the value of i-th particle with respect to m-th dimension of feature space at time t; the particles in the search process reach the optimal location and are marked as p t i; the optimal location for the group is g t i.Therefore, the location of particle i in j dimension at time t is described in the following formulas: In Equation ( 3), X represents the new value of the particle in every dimension randomly generated; the random number is between (0, 1); Cw, Cp, and Cg are the three predetermined positive constants with Cw < Cp < Cg.
In this study, we use the topological structure of a PPIN [29] as the basis, with the individual proteins as nodes and the interactions between proteins as lines, to construct a PPIN model (with interaction model shown in Figure 2, and adjacency matrix in Table 1).The interaction between the proteins is denoted as 1, whereas no relationship between proteins is denoted as 0.
Therefore, the distance between proteins dij (difference between two proteins), can be calculated according to Equation (4): Therefore, the distance between proteins d ij (difference between two proteins), can be calculated according to Equation (4): where i and j express the two proteins.Normally, the value of d ij is greater than 0; in some special cases, when proteins i and j are in completely different function modules, d ij achieves the highest value 1.

Parameter Setting
Parameter setting plays a key important role in detecting function modules.In SSO algorithm, the initial location of a particle swarm set to be random.In addition, C w , C p , and C g are set as 0.25, 0.5, and 0.75 respectively.The values of C w , C p , and C g in this paper are set as 0.1, 0.55, and 0.8 to expand the search area to a global search at the beginning of the iteration.Table 2 shows the parameter setting of the SSO and PSO algorithms used in the experiments (t expresses t-th, Max_GEN expresses maximum number of iterations).

Model Optimization
Model optimization is divided into two main parts: module planning based on function information, module planning based on topology.

Module Planning Based on Function Information
The objective of this step is to merge the similar protein function modules (PFMs).The basic idea is to measure the similarity of two modules.When the similarity is greater than a certain threshold, two modules can be merged, based on Equation (5): where M S and M T represent the size of the two protein function modules (including the number of proteins) respectively, and s(i, j) is characterized by the following Equation ( 6): Among these paramaters, f ij is the similarity function based on gene topology and is characterized by the following Equation ( 7) [30]: In Equation ( 7), g i and g j represent the comment values of protein i and j in the Gene Ontology respectively [31].The greater value of f ij indicates higher similarity between two proteins.

Module Planning Based on Topology
This step measures the density of the initial protein function module and reduces the sparse protein module through filter setting.The density is calculated according to Equation (8): where n denotes the number of current protein function module and e represents the number of interactions in the module.

Dataset Description
We select four different categories of species data sets for experiment: fruitfly, mouse, scere, and human.Additionally, we use the GO (Gene Ontology) for unifying the format of four species data.Via extracting and matching, the final data statistics are illustrated in Table 3.

Complexity and Running Times
First, we compare the complexity between SSO and PSO (a typical intelligent algorithm).Assuming that the iteration number of i particles was N i , i = 1, 2, ..., m, m is the maximum number of iteration, Assuming that each particle in each iteration requires the computational time T t , the total execution time of PSO algorithm for optimal operation is N • m • T t .As for SSO, assuming that each particle in each iteration requires the computational time D t , the total execution time in optimal operation is N • m • D t , where 4 further illustrates the experimental comparison, and shows that the SSO algorithm is more efficient than the PSO algorithm.Besides, for four species data, eight different threshold values were selected in the experiment: 0.05, 0.055, 0.06, 0.065, 0.07, 0.075, 0.08, and 0.085, with the result illustrated in Figures 3-5.Results show that when the threshold value increases to a certain extent, the unnecessary protein filtration is greatly reduced (illustrated in Figure 3b), and the size of the protein function module (PFM) increases (illustrated in Figure 3a).However, the other four aspects of the effect including degree of polymerization in the module, the deviation degree between modules, and so on (corresponding to Figures 4 and 5) remained nearly at the same level.
described in Section 2.3).The reason for the selection of human species is that the number of nodes is more than other experimental data and the data integrity is better, which can better reflect the characteristics of the algorithm.Figure 6 shows the results under the detection strategy of PSO and SSO.
In the Figure 6, we find that the curves of PSO and SSO strategies are more or less intertwined.The difference between PSO and SSO is not obvious in the index of "cohesion", however, the curve of SSO is more stable than PSO algorithm.This indicates that the SSO-based solution outperforms the PSO-based approach in "Separation".described in Section 2.3).The reason for the selection of human species is that the number of nodes is more than other experimental data and the data integrity is better, which can better reflect the characteristics of the algorithm.Figure 6 shows the results under the detection strategy of PSO and SSO.
In the Figure 6, we find that the curves of PSO and SSO strategies are more or less intertwined.The difference between PSO and SSO is not obvious in the index of "cohesion", however, the curve of SSO is more stable than PSO algorithm.This indicates that the SSO-based solution outperforms the PSO-based approach in "Separation".The deviation degree between modules  In order to evaluate the efficiency and feasibility of SSO algorithm for protein module detection, we take the human dataset and conduct experiments for evaluation (based on cluster indices described in Section 2.3).The reason for the selection of human species is that the number of nodes is more than other experimental data and the data integrity is better, which can better reflect the characteristics of the algorithm.Figure 6 shows the results under the detection strategy of PSO and SSO.Meanwhile, Table 5 indicates that the number of function module generated by SSO algorithm is a bit lower than PSO in fruitfly species.This may be because of the small number of protein nodes in fruitfly species.Table 6 shows the filtered protein number for PSO and SSO algorithms, which indicates that SSO is significantly better than those of PSO, especially when the number of protein increases.Table 7 shows the degree of polymerization in the module for both PSO and SSO algorithms.A higher value indicates higher similarity in the module.The result also reveals that the two algorithms are relatively close, however, the SSO algorithm has better stability.The deviation degree between modules PSO SSO In the Figure 6, we find that the curves of PSO and SSO strategies are more or less intertwined.The difference between PSO and SSO is not obvious in the index of "cohesion", however, the curve of SSO is more stable than PSO algorithm.This indicates that the SSO-based solution outperforms the PSO-based approach in "Separation".
Meanwhile, Table 5 indicates that the number of function module generated by SSO algorithm is a bit lower than PSO in fruitfly species.This may be because of the small number of protein nodes in fruitfly species.Table 6 shows the filtered protein number for PSO and SSO algorithms, which indicates that SSO is significantly better than those of PSO, especially when the number of protein increases.Table 7 shows the degree of polymerization in the module for both PSO and SSO algorithms.A higher value indicates higher similarity in the module.The result also reveals that the two algorithms are relatively close, however, the SSO algorithm has better stability.

Conclusions
In this study, we introduce relevant research on protein interaction networks conducted in recent years, including the commonly used protein databases and existing detection methods.We then describe our proposed SSO algorithm for the detection problem of protein function module (PFM) in PPIN.Simultaneously, biological gene ontology knowledge is combined to improve the prediction accuracy.The performance of SSO is compared with existing work (typically PSO algorithm) through the analysis of the experimental results.Results show the feasibility and efficiency of our proposed SSO algorithm.All the datasets and code related with this paper are available from https: //github.com/wulingting/PPIN-SSO-and-PSO-algorithms.

Figure 5 .
Figure 5. (a) Actual combining degree in each threshold; (b) actual filter density during the experiment.

Figure 5 .Figure 6 .
Figure 5. (a) Actual combining degree in each threshold; (b) actual filter density during the experiment.

Figure 6 .
Figure 6.(a) Degree of polymerization in the module; (b) degree of deviation between modules.

Table 2 .
Parameter setting for PSO and SSO algorithm 1 .

Table 4 .
The average running time comparison.

Table 5 .
Change of module number for PSO and SSO algorithms.

Table 5 .
Change of module number for PSO and SSO algorithms.

Table 6 .
Filtered protein number for PSO and SSO algorithm.

Table 7 .
Degree of polymerization in the module for PSO and SSO algorithm.