A Two-Layer SVM Ensemble-Classifier to Predict Interface Residue Pairs of Protein Trimers

Study of interface residue pairs is important for understanding the interactions between monomers inside a trimer protein–protein complex. We developed a two-layer support vector machine (SVM) ensemble-classifier that considers physicochemical and geometric properties of amino acids and the influence of surrounding amino acids. Different descriptors and different combinations may give different prediction results. We propose feature combination engineering based on correlation coefficients and F-values. The accuracy of our method is 65.38% in independent test set, indicating biological significance. Our predictions are consistent with the experimental results. It shows the effectiveness and reliability of our method to predict interface residue pairs of protein trimers.


Introduction
Many protein complexes are formed by the interactions of multiple protein monomers. These complexes can carry out many biological functions, such as gene expression and regulation, signal transduction, or enzyme catalytic mechanisms [1]. Understanding the mechanisms of protein-polymer interactions can provide useful information for the design of protein polymer structures, protein functional annotation, and drug design [2]. The accurate prediction of interface residue pairs in polymeric proteins is an important part in the study of protein polymer interactions. Various experimental methods have been used in the research of protein polymers, such as X-ray crystallography and nucleic magnetic resonance. It is impossible and unrealistic to find the interface residue pairs for all protein polymers by an experimental method. Therefore, the prediction of protein polymer interface residue pairs has become an important question in bioinformatics.
An increasing number of computational methods have been developed to predict protein monomer binding sites and protein-protein interface residue pairs. In relation to this, Segura et al. [3] used a two-step Random Forest classifier to predict protein binding sites. Hwang et al. [4] proposed an index of Residue Contact Frequency to predict the protein monomer binding sites. Lyu et al. [5] defined an index of contact frequency from the protein-protein docking result to accurately predict protein-protein interface residue pairs. Ovchinnikov predicted residue-residue interactions across protein interfaces using evolutionary information [6]. There are many other methods that are not described here [7][8][9][10][11][12][13][14][15].
A higher number of protein monomers implies more complex interaction mechanisms of the protein polymer. At present, there are a few methods to predict interface residue pairs of protein trimer. Zhao et al. [16] took the sequence feature as input in multilayered Long Short-Term Memory networks to predict interface residue pairs of protein trimer. In this paper, we want to develop a new and effective method for prediction of interface residue pairs in protein trimers.
Properties of protein sequences depend on the composition and distribution of amino acids. The position and type of each amino acid in a protein sequence are unique, containing important structural and functional information. Therefore, for a specific position amino acid in the protein sequence, ideal descriptors should not only reflect the amino acid type but also structural information or the role played in the performance of the protein function. Based on previous studies on protein monomer binding sites and protein-protein interface residue pairs, we found some common properties that can allow to distinguish interface residues from the rest of the protein, using information of hydrophobicity, polarizability, solvent accessibility, and so on. We summarized these properties and divided them into two categories. The first category corresponded to amino acid physicochemical properties, including hydrophobicity [17][18][19], polarizability [20], and polarity [21]. The second category corresponded to residue geometric properties, such as accessible surface area (ASA) [22] and relative accessible surface area (RASA) [23].
In this paper, we defined an amino acid k-interval product factor to describe the influence of surrounding amino acids based on their physicochemical properties. Hence, we described a residue pair with three types of characteristics: amino acid physicochemical features, residue geometric features, and amino acid k-interval product factor. Different descriptors and different combinations may give different prediction results. We performed feature combination engineering based on correlation coefficients and F-values for all characteristics. In general, when the number of positive and negative samples in the dataset is seriously unbalanced, the accuracy and robustness of ensemble classifiers is higher than those of a single classifier (negative samples: noninterface residue pairs and positive samples: interface residue pairs). We trained a two-layer support vector machine (SVM) ensemble-classifier method to predict the interface residue pairs of protein trimers and tested it using an independent testing set. We also use different indicators to evaluate testing set results, which proves that our method is feasible.
In summary, our method was divided into four parts: feature extraction, feature vector engineering, generation of a two-layer SVM ensemble-classifier, and performance evaluation, as shown in Figure 1.

Dataset
In this paper, the dataset was collected from the Protein Data Bank based on following four requirements: the number of chains is 3, the length of each chain is between 20 and 500, it is obtained by X-ray experiment, and there are physical bindings between each two chains in one protein trimer. Two chains are defined as interactors if there are interface residue pairs between the two chains. (If the contact area between any two atoms from two residues is bigger than zero, we called these two residues in contact and these two residues are called an interface residue pair. Here, we used the Qcontacts software to calculate contact area between two atoms.) By this way, we collect 78 protein trimers (The data can be downloaded from Supplementary Materials). We randomly divided 78

Dataset
In this paper, the dataset was collected from the Protein Data Bank based on following four requirements: the number of chains is 3, the length of each chain is between 20 and 500, it is obtained by X-ray experiment, and there are physical bindings between each two chains in one protein trimer. Two chains are defined as interactors if there are interface residue pairs between the two chains. (If the contact area between any two atoms from two residues is bigger than zero, we called these two residues in contact and these two residues are called an interface residue pair. Here, we used the Qcontacts software to calculate contact area between two atoms.) By this way, we collect 78 protein trimers (The data can be downloaded from Supplementary Materials). We randomly divided 78 protein trimers into training set and testing set, of which the number of training set is 52, accounting for 2/3 of the total, and the remaining 26 protein trimers are used as testing set (see Table 1).
Different amino acids in Formula (1) have different physicochemical properties. These physicochemical properties, including hydrophobicity [17][18][19] and polarizability [20], play important roles in multiple protein monomer interactions. In this paper, we considered five physicochemical properties of amino acids: hydrophobicity [17][18][19], polarizability [20], polarity [21], secondary structure, and codon diversity [24]. Three versions of hydrophobicity were proposed by Tanford Charles, Jack Kyte, and David Eisenberg in 1962, and 1984, respectively. Other previous works, [24,25] have used the hydrophobicity value of Tanford [17] as a feature to identify protein-protein interactions. Some other authors [15,26] used the last two versions of the hydrophobicity value [18,19] as features to predict residue pairs in a protein-protein interface. Their prediction results were good, so we took three versions of the hydrophobicity index in our method. Appendix A Table A1 shows the numerical values of the five physicochemical properties for the 20 amino acids.
According to the corresponding properties of each amino acid, the protein sequence P can be converted into seven different number sequences (see Formula (2)). We used Φ 1 , Φ 2 , Φ 3 , Φ 4 , Φ 5 , Φ 6 , and Φ 7 to represent the seven numerical sequences. These seven numerical sequences are the hydrophobicity number sequence 1, polarizability number sequence, polarity number sequence, secondary structure number sequence, codon diversity number sequence, hydrophobicity number sequence 2, and hydrophobicity number sequence 3.
where Φ 1 1 is the hydrophobicity value of P 1 in formula 1, Φ 1 2 is the hydrophobicity value of P 2 in formula 1, and so on. Φ 2 1 is the polarizability value of P 1 in formula 1, Φ 1 2 is the polarizability value of P 2 in formula 1, and so on. In multiple protein monomers interactions, the individual behavior of the amino acid at each position is affected by the neighboring amino acids in the protein sequence. We define the amino acid k-interval product factor to describe the influence of neighboring amino acids on a given residue.
The AAIPF(k) is defined as follows: the numbers at two positions with interval k are multiplied and divided by k on the amino acid number sequence. The AAIPF(k) can be divided into: amino acid forward k-interval product factor (AAFIPF(k)) and amino acid backwards k-interval product factor (AABIPF(k)) (see Formulas (3)-(5)).
When exploring the individual behavior of each amino acid in a protein sequence P, as previously reported [27], we regard the protein sequence P as a cycle alphabet sequence with head-to-tail connections, and thus number sequences can also be regarded as cycle number sequences.
Considering the dimensionality of descriptors, and using the experience of previous works [28,29], we only used AAIPF(1), AAIPF(2), AAIPF(3), AAIPF(4), and AAIPF(5) to characterize each amino acid in the protein sequence P. In this way, each amino acid in the protein sequence P could be represented by the 10-dimensional characteristics of each numerical sequence. To reduce the redundancy between features, we only use the first five numerical sequences in formula 2 to calculate AAIPF(k). Thus, each amino acid in protein sequence P could be characterized by 50 features.
We also used as features the values of five physicochemical properties of amino acids, which we called basic first-order sequence features. Three versions of the hydrophobic values were used. Therefore, seven basic first-order sequence features were applied to describe an amino acid.
Considering the electrostatic interaction is also one of the important factors to stabilize the protein structure. Therefore, we used the electric property values (pK 1 and pK 2 ) as features to describe the residue. The values of pK 1 and pK 2 were calculated by the propka3.1 software [30].

Residue Geometric Features
In several previous research studies [31][32][33], it has been found that accessible surface area (ASA) and relative solvent accessible surface area (RASA) play important roles in distinguishing between interface residues and noninterface residues. In addition to the above two geometric features, we also use three residue geometric features, exterior contact area (ECA), interior contact area (ICA), and exterior void area (EVA) extracted by our laboratory to describe each residue. These five geometric features were considered the basic structural geometric features. These five geometric features and their calculation tools are shown in Table 2. ASA is the surface area of molecules that is accessible to solvents. Here, we used the Naccess V2.1.1 software [34] to calculate ASA. The RASA was used to describe the exposed or buried state of Molecules 2020, 25, 4353 5 of 16 the residue and was calculated by formula 6. The specific definition of ECA, ICA, and EVA is given in detail elsewhere [26]. ECA and ICA were calculated by the Qcontacts software [35].
where ASA bound represents the solvent accessible surface area of the residue in the protein complex. ASA unbound is the solvent-accessible surface area of this residue in unbound state. Through the above research and analysis, we extracted a total of 64 dimensional characteristics to describe each residue in the protein monomer. Therefore, we can use 128 dimensional characteristics to describe a residue pair formed by residues from two protein monomers. We used Formula (7) to standardize these 128 dimensional characteristics.

Feature Vector Engineering
Different characteristics and their combinations may play different roles in the prediction of interactions interface residues pairs in multiple protein monomers. Therefore, we performed feature vector engineering using two sets of feature vectors to describe a residue pair. We used the 128 dimensional characteristics extracted as the first set of feature vectors. Considering that some of the 128 dimensional characteristics have a strong correlation, we deleted 1/4 of the characteristics, and used the remaining characteristics as the second set of feature vectors. We filtered the characteristics according to the Pearson correlation coefficients r and F-values. The specific process was as follows: In the first step, 128 dimensional characteristics were clustered according to the Pearson correlation coefficient r. The formula of the Pearson correlation r is as follows: where X k and X l represent the k-th and l-th characteristics of the sample, respectively. σ k and σ l represent the mean square deviation of the k-th and l-th characteristics of the sample (sample: residue pairs), respectively. In the second step, we used Formula (9) to calculate the F-value of each characteristic between the positive sample and the negative sample (negative sample: noninterface residue pairs and positive sample: interface residue pairs). The larger the F-value of the characteristic, the greater the difference between positive and negative samples and the greater the contribution of the characteristic to distinguish positive and negative samples.
where µ + m and µ − m are the mean values of the m-th characteristic of the positive sample and the negative sample, respectively. σ + m and σ − m are the mean square deviation of the m-th characteristic of the positive sample and the negative sample, respectively.
We preserved basic first-order sequence characteristics and basic structural geometric characteristics. For the class with Pearson correlation coefficient |r| > 0.5, we preserved the characteristics of a relatively large F-value, with 96 characteristics in total. We used the 96 characteristics as the second set of feature vectors (see Appendix A Table A2).

Our Algorithms (A Two-Layer SVM Ensemble-Classifier)
Each protein trimer consists of three chains, and any two chains interact to form a protein-protein interaction interface. We split each protein trimer into three protein-protein interactions. Through Section 2.2 Feature Extraction and Section 2.3 Feature Vector Engineering, two sets of feature vectors were used to represent a residue pair. Therefore, we generated two sets of train data for the training trimer protein complexes. The train data composed of the first (second) set of feature vector was called Train data 1 (2).
The proportion of protein-protein interface residue pairs in all residue pairs was very low. Therefore, the positive and negative classes in train data 1 (2) were extremely imbalanced (negative classes: noninterface residue pairs and positive class: interface residue pairs). We used under-sampling to deal with the class imbalance problem. To improve the performance of the method, we used the classifier ensemble to alleviate the lack of information caused by the under-sampling. The process was as follows: Generation of balanced subset samples (for training set j (j = 1,2)): First, we randomly generated 100 subset samples of negative class from all the negative class samples in train data j. The total number of positive class was 12,687 in the entire training set. Therefore, we set the number of negative class in each subset sample to 12,687. Second, we combined each negative class subset sample with all positive class samples to generate a balanced subset sample. Finally, obtained 100 balanced subset samples for train data j (j = 1,2).
Generation of an ensemble classifier: SVM is a supervised machine learning method, which is widely used in the field of protein-protein interactions. Here, we also used SVM to predict trimer protein complexes interface residue pairs.
There are 100 balance subset samples in train data j (j = 1,2), each of which can be used to train an SVM model. We obtained 100 individual SVM predictors for train data j. We then developed an ensemble SVM classifier P j by fusing the 100 individual SVM predictors in train data j through a probability system, as shown in formula 10. Finally, a two-layer SVM ensemble-classifier P was formed by fusing two ensemble SVM classifiers P 1 and P 2 through a weight ω (here, we set ω to 1/2). We provide a flowchart in Figure 2 to illustrate how to generate a two-layer SVM ensemble-classifier P (An implementation of our model is available at the website ftp://202.112.126.135/pub/Trimer/code).
where SVM i indicates the SVM predictor trained with an i-balance subset sample. x indicates a residue pair. SVM i (x) indicates the probability that the residue pair x is the interface residue pair in the i-th individual SVM prediction of train data j.

Evaluation Criteria
The output of our model is a value between 0 and 1, showing the possibility of the residue pair to be an interface residue pair. The values were sorted from large to small. The number t predicted interface residue pairs with highest probability were used as the top t predicted interface residue pairs.
We used the following three measures to evaluate the performance of our method. First, we defined a three-dimensional vector NPRPT(t) = (n 1 , n 2 , n 3 ) t , where n z (z = 1,2,3) represents the number of positive interface residue pairs in the top t predictions for each possible protein-protein interface of a protein trimer. Here, NPRPT(t) is the abbreviation of the number of positive interface residue pairs in the top t predictions for each protein trimer. The first index is NPRPT(t) 0 , which represents the L0 norm of NPRPT(t). It is consistent with the meaning of the L0 norm of vector in mathematics, which represents the number of nonzero elements in a vector. So NPRPT(t) 0 represents the number of interfaces that we can correctly predict in each protein trimer. In the top t predictions provided, if there is at least one positive interface residue pair, we assumed that the protein-protein interaction interface could be predicted correctly.
where indicates the SVM predictor trained with an i-balance subset sample. indicates a residue pair.
( ) indicates the probability that the residue pair is the interface residue pair in the i-th individual SVM prediction of train data j.

Evaluation Criteria
The output of our model is a value between 0 and 1, showing the possibility of the residue pair to be an interface residue pair. The values were sorted from large to small. The number t predicted interface residue pairs with highest probability were used as the top t predicted interface residue pairs. The second index is NPRPT(t) 1 , which represents the L1 norm of NPRPT(t), see Formula (12). It is consistent with the meaning of the L1 norm of vector in mathematics. Therefore, NPRPT(t) 1 represents the number of positive interface residue pairs in the top t predictions at a protein trimer The third index is index accuracy rate (see Formula (13)).
SCT represents the sum of all correctly predicted trimer protein complexes. In the top t predictions, if there were z protein-protein interaction interfaces satisfying at least one positive interface residue pair, we assumed that the protein trimer was predicted correctly. TNT represents the total number of trimer protein complexes in the dataset.

Application of our Algorithms on the Testing Set
There were 26 trimer protein complexes in the testing set. Each protein trimer consists of three chains, and any two chains interact to form a protein-protein interaction interface. Therefore, we obtained 78 protein-protein interaction interfaces. Through Section 2.2 Feature Extraction and Section 2.3 Feature Vector Engineering, two sets of feature vectors can be used to represent a residue pair. Therefore, we can generate test data 1 and test data 2 for the testing trimer protein complexes. Then by inputting test data 1 and test data 2 to our proposed Algorithms, we can get the prediction results of the 78 protein-protein interaction interfaces. Table 3 shows the top t (t = 10, 15, 20, and 30) predictions and the two evaluation indexes corresponding to the testing set results. It can be seen from Table 3 that when 3 protein-protein interaction interfaces of each protein trimer were correctly predicted, a total of 9 trimer protein complexes were correctly predicted in the top 10 predictions. The prediction result of 2IY0 protein trimer was the best, with up to 10 positive interface residue pairs. When 3 protein-protein interaction interfaces of each protein trimer are correctly predicted, a total of 17 trimer protein complexes were correctly predicted in the top 30 prediction results. Among them, there were 10 trimer protein complexes for which at least 10 positive interface residue pairs were predicted. Table 3. Two evaluation indexes of the testing set prediction results.  2  2  2  2  2  3  2  5  1w9z  3  3  3  3  3  3  3  7  1wdj  2  2  2  4  2  6  2  8  1ynb  2  2  2  3  2  4  2  4  1za7  1  4  2  6  3  9  3  12  2ig8  1  1  1  2  2  3  3  5  2ium  1  2  3  4  3  6  3  9  2iy0  3  10  3  14  3  16  3  19  2izw  1  2  2  3  2  4  3  9  2ms2  3  5  3  8  3  8  3  10  2r3u  0  0  0  0  1  1  2  2  2wr5  0  0  0  0  0  0  0  0  3dli  3  3  3  3  3  3  3  6  3ffd  3  4  3  4  3  8  3  10  3m6n  0  0  0  0  0  0  1  1  3owt  1  3  2  4  2  5  2  8  3p5j  0  0  2  2  2  3  3  4  3qks  0  0  1  1  2  2  3  6 To further analyze the prediction results, we obtained the index accuracy rate (t) z from all NPRPT 0 columns in Table 3 (see Table 4). When 3 protein-protein interaction interfaces of each protein trimer were correctly predicted in the top 15 predictions, the accuracy rate was 42.31%, i.e., more than 2/5 of trimer protein complexes in the testing set were correctly predicted. When 3 protein-protein interaction interfaces of each protein trimer were correctly predicted in the top 30 predictions, the accuracy rate was as high as 65.38%. When at least 2 protein-protein interaction interfaces of each protein trimer are correctly predicted in the top 10 predictions, the accuracy rate was 53.85%, i.e., more than half trimer protein complexes in the testing set were correctly predicted. When at least 1 protein-protein interaction interface of each protein trimer was correctly predicted, the accuracy rate was 76.92% in the top 10 predictions and up to 92.31% in the top 30 predictions. There are 6479 pairs of interface residue pairs in the test set, of which 968 pairs are formed by residues at N-and C-terminal regions, accounting for 15% (here, residues at the N-and C-terminal regions is the residues that we have specially treated in the manuscript). We can accurately predict 190

Analysis of the Testing Set Results
Molecules 2020, 25, 4353 9 of 16 interface residue pairs for all testing set protein trimers, of which 48 pairs are formed by residues at N-terminal and C-terminal, accounting for 25%.
We compared the performance of our method with the previous method [16]. When at least 1 protein-protein interaction interface of each protein trimer was correctly predicted, the accuracy of our method is 76.92% and of the previous method [16] is 31.1% in the top 10 predictions. The accuracy of our method is higher than them.
The analysis of the above results showed that our proposed method was able to accurately predict the interface residue pairs of trimer protein complexes. Additionally, our predicted results are consistent with the experimental results. In the experimental article of the 3ffd protein trimer [36], it is mentioned that residues 20 and 24 are strictly conserved, which allows for extensive interactions with the antibody. Residues 16,20,27,52,59,97,102 The training set contains a lot of antibody fragments, which make up two of the three chains: 1BGX, 3O2D, 3R1G. 3GI9, 1JPS, 1JRH, 1FNS. Similarly, 1F6F, 1EER, 1HWG, and 3VA2 are all cytokine receptor complexes with probable similarity between the receptor CRH domains. We deleted 1BGX, 3O2D, 3R1G. 3GI9, 1JPS, 1JRH, 1FNS 1F6F, 1EER, 1HWG, 3VA2 in the training set. The test set also contains 3 complexes with antibody chains: 3FFD, 1OSP, and 1SY6. We generated testing set 2, which deleted 3FFD, 1OSP, and 1SY6 relative to testing set. Appendix Table A3 shows the top t (t = 15, 20, and 30) predictions and the two evaluation indexes corresponding to the testing set 2 results. We compared the prediction results of testing set with that of testing set 2 (Table 5). When at least 2 protein-protein interaction interfaces of each protein trimer are correctly predicted, the accuracy of testing set 2 is about 7% lower than that of testing set. When at least 3 protein-protein interaction interfaces of each protein trimer are correctly predicted in the top 30 predictions, the accuracy of testing set 2 is about 8.5% lower than that of testing set. When at least 3 protein-protein interaction interfaces of each protein trimer are correctly predicted in the top 20 predictions, the accuracy of testing set 2 was 6% higher than that of test set. The rest of the prediction results of the two test sets are almost the same. Table 5. Accuracy rate ( ) of the testing set and testing set 2 prediction results. The training set contains a lot of antibody fragments, which make up two of the three chains: 1BGX, 3O2D, 3R1G. 3GI9, 1JPS, 1JRH, 1FNS. Similarly, 1F6F, 1EER, 1HWG, and 3VA2 are all cytokine receptor complexes with probable similarity between the receptor CRH domains. We deleted 1BGX, 3O2D, 3R1G. 3GI9, 1JPS, 1JRH, 1FNS 1F6F, 1EER, 1HWG, 3VA2 in the training set. The test set also contains 3 complexes with antibody chains: 3FFD, 1OSP, and 1SY6. We generated testing set 2, which deleted 3FFD, 1OSP, and 1SY6 relative to testing set. Appendix A Table A3 shows the top t (t = 15, 20, and 30) predictions and the two evaluation indexes corresponding to the testing set 2 results. We compared the prediction results of testing set with that of testing set 2 (Table 5). When at least 2 protein-protein interaction interfaces of each protein trimer are correctly predicted, the accuracy of testing set 2 is about 7% lower than that of testing set. When at least 3 protein-protein interaction interfaces of each protein trimer are correctly predicted in the top 30 predictions, the accuracy of testing set 2 is about 8.5% lower than that of testing set. When at least 3 protein-protein interaction interfaces of each protein trimer are correctly predicted in the top 20 predictions, the accuracy of testing set 2 was 6% higher than that of test set. The rest of the prediction results of the two test sets are almost the same.

Comparison with Random Results
We assume that the stochastic prediction of interface residue pairs of each protein-protein interaction interface in trimer protein complexes obeys a hypergeometric distribution X ∼ H(N, M, T) [38]; where X is the number of positive interface residue pairs in the top T predictions. N is the number of all the residue pairs of one protein-protein interaction interface in one protein trimer. M is the number of positive interface residue pairs in this protein-protein interaction interface. Next, we can calculate the probability P that there are x positive interface residue pairs in the T predictions of one protein-protein interaction interface by the stochastic model (see Formula (14)): In order to simplify the calculation, we assumed that each protein-protein interaction interface was independently identically distributed, and N is the mean value of all residue pairs in each protein-protein interaction interface, and M is the mean value of positive interface residue pairs in each protein-protein interaction interface. It can be seen that N is about 40,920 and M is about 83 in the Appendix A Table A4 When at least 1 protein-protein interaction interface of each protein trimer has at least one positive interface residue pair in T predictions, the probability P 1 is: Consideration of the complexity of the P 1 calculation, we have made an enlarged calculation of P 1 (see inequality 16 and 17). Obviously, the computational complexity of theP 1 is less than P 1 , and when T is fixed, P 1 is less thanP 1 . When the value of T is 10, 15, 20, and 30, we can calculateP 1 through the Monte Carlo simulation method (see Table 6).
When at least 2 protein-protein interaction interfaces of each protein trimer have at least one positive interface residue pair in T predictions, the probability P 2 is: Combining Formulas (15) and (18), we also enlarge P 2 and obtained inequality 20. Obviously, the computational complexity of theP 2 is much less than P 2 , and when T is fixed, P 2 is less thanP 2 . When the value of T is 10, 15, 20, and 30, we can calculateP 2 through the Monte Carlo simulation method (see Table 6).
As can be seen from Table 5, the accuracy of our method to predict the interface residue pairs of trimer protein complexes is much higher than that of random results. When at least 1 protein-protein interaction interface of each protein trimer was correctly predicted, our method accuracy was over 76.92% and up to 92.31%, while the random accuracy was lower than 0.20298%. When at least 2 protein-protein interaction interfaces of each protein trimer were correctly predicted, our accuracy was over 53.85% and up to 84.62%, whereas the random accuracy was below 0.0015%. When 3 protein-protein interaction interfaces of each protein trimer were correctly predicted, our accuracy achieved 65.38% in the top 30 predictions, and the accuracy was more than 108 times higher than that of random results.

Conclusions
In this paper, we defined an amino acid k-interval product factor to describe the influence of neighboring amino acids on a residue. This method takes advantage of the physicochemical and geometric properties of amino acids, and also considers the influence of neighboring amino acids (amino acid k-interval product factor) as features. Finally, we developed a two-layer SVM ensemble-classifier method, based on feature vector engineering and SVM, to predict the interface residue pairs of trimer protein complexes. In our testing set, the accuracy rate of successfully predicting one interface was 84.62%, and for two interfaces, the accuracy rate was 73.08%, in the top 15 predictions, which indicates significance for biological experimentation and biomedical-related research. Moreover, our predicted results are consistent with the experimental results. This shows that our method is effective and reliable to predict interface residue pairs of trimer protein complexes. However, our accuracy rate was not high when three interfaces of one trimer are required to predict correctly. We also did not consider protein conformational changes. These are the areas where we will improve in the future.