CoRNeA: A pipeline to decrypt the inter protein interfaces from amino acid sequence information

Motivation Decrypting the interface residues of the protein complexes provide insight into the functions of the proteins and hence the overall cellular machinery. Computational methods have been devised in the past to predict the interface residues using amino acid sequence information but all these methods have been majorly applied to predict for prokaryotic protein complexes. Since the composition and rate of evolution of the primary sequence is different between prokaryotes and eukaryotes, it is important to develop a method specifically for eukaryotic complexes. Results Here we report a new hybrid pipeline for the prediction of protein-protein interaction interfaces from the amino acid sequence information which is based on the framework of co-evolution, machine learning (random forest) and network analysis named CoRNeA trained specifically on eukaryotic protein complexes. We use conservation, structural and contact potential as major group of features to train the random forest classifier. We also incorporate the intra contact information of the individual proteins to eliminate false positives from the predictions keeping in mind that the amino acid sequence also holds information for its own folding and not only the interface propensities. Our prediction on example datasets shows that CoRNeA not only enhances the prediction of true interface residues but also reduces false positive rates significantly.

The physicochemical properties of the residue can be derived from sequence information but 1 8 4 to derive pair wise values for these properties, we employed the 20X20 residue matrices 1 8 5 which were described to aid in ab initio modelling of single protein (Biro 2006). These To calculate the pairwise RSA values, RSA of independent proteins were calculated using  The secondary structure of the proteins was predicted using PSIPRED(Jones 1999) and all 1 9 5 residues were assigned numbers (i.e. 1= α -helix, 2=β-sheet and 3=l-loop). A simple 1 9 6 multiplication and scaling of these numbers between 0 and 1 would yield in a combination 1 9 7 where α -helix to α -helix instance will be ranked lowest. To avoid this mis scaling, the 1 9 8 training dataset was inspected for the nature of residue-residue combinations in terms of  were calculated based on statistical analysis of protein structures. The other two 2 0 7 approximations were derived from the MJ matrix, where a 2-body correction was applied on 2 0 8 this matrix to generate two separate matrices (Zeng, Liu and Zheng 2012). One of them was 2 0 9 specific for capturing the interactions between exposed residues and the other one for buried 2 1 0 residues. Thus, all three possible combinations were used to derive three contact potential 2 1 1 (MXN) matrices namely, CP: original MJ matrix, CPE: MJ matrix derived for exposed 2 1 2 residues and CPB: MJ matric derived for buried residues, for the pair of interacting proteins. To include residue environment information for training the machine learning algorithm, a 2 1 5 kernel matrix of size 5*5 was defined and convolved over the nine feature matrices as 2 1 6 described above. The convoluted features were generated by using OpenImageR The interface residues for the protein complexes were extracted using PISA(Krissinel and (20,00,000 for 42 complexes). To increase the search space and take into consideration the 2 2 5 environment of the contact forming residues, a distance cut off of 10Å was used to search for 2 2 6 possible pair of residues flanking -2 to +2 positions of the interface residues extracted from 2 2 7 PISA. This yielded ten times more positive labels (5000 pairs for 42 complexes) for training 2 2 8 the classifier. Although increasing the search space as explained above yielded 10 times more datapoints, 2 3 1 still the complete protein complex database exhibited highly imbalance data. 5000 pairs were 2 3 2 labelled as positive out of the total 20,00,000 pairs. In order to address this imbalance class 2 3 3 problem, the majority class which was the negative data labels (non-interface residues pairs) iteratively (e.g. 2:1, 5:1, 10:1 and 20:1) and best evaluation statistics were obtained when the 2 3 6 negative sample size was five times that of positive samples (5:1). This was used as training 2 3 7 set for the supervised classification model.

Random Forest Classifier
The random forest classifier was trained first using grid search to optimize the 2 4 0 hyperparameters for the model yielding the best evaluation statistics through cross validation.

4 1
The hyperparameters obtained from the grid search were then used to train the classifier with scores for the interface residues as opposed to other co-evolution methods ( to provide more confidence to the co-evolving pair of residues and decreasing the noise by proteins maintain the homeostasis of the interaction across species hence using them as a 2 9 0 feature as opposed to the standard PSSM based conservation methods(such as provided better predictability.

9 4
The nature of physicochemical properties of the residue interaction in the protein interface protein. It has been reported that the interface environment is closer to that exhibited on the 2 9 7 outside in contact with the solvent as opposed to that present in the core of the protein(Jones and Thornton 1995). For example, relative solvent accessibility of a residue which defines its 2 9 9 possible position in the protein i.e. whether it will be present in the core of the protein 3 0 0 (relative solvent accessibility of 0) or is solvent exposed (relative solvent accessibility >0).

0 1
For the residues which lie in the PPI interface should have value as 0<RSA<1, if the value is 3 0 2 scaled between 0 and 1. Due to lack of specific standard matrices for inter-protein residue The knowledge based statistical potentials have also been used previously to mimic the ideally lie in between those of buried and exposed residues. To access their applicability in 3 1 2 identifying interface residues of the interacting proteins three approximations of these contact The contacts between two residues of the interacting proteins also depends on its 3 1 5 neighbouring residues by creating a favourable niche for the interaction to take place. Hence 3 1 6 the properties governing the interaction (as described above) of the neighbouring residues 3 1 7 will also have an impact on the overall predictability of the random forest classifier. To 3 1 8 address this, the random forest classifier was trained in two different modes i.e. with and 3 1 9 without environment features, the results of which are explained below. To validate the effect of the environment features on the random forest classifier, the all the evaluation statistics, the classifier predicts with better precision and recall and hence predicting the contact forming residue pairs for the interacting proteins.  One of the marked features of random forest classifier is that it is able to decipher the 3 4 7 importance of every feature used for training which can be used to determine the over-fitting Chain C of 1GCQ obtained from top 5% co-evolving intra residue pairs. (C) Inter-protein The results obtained from the network are shown onto the structure of VAV and GRB2 SH3 4 1 1 domains (PDB ID 1GCQ) ( Figure 6A). Interestingly, the data labels provided while testing 4 1 2 were only for Chain A and Chain C but the labels obtained after prediction were for both the 4 1 3 pairs i.e. Chain A and Chain C ( Figure 6B) as well as Chain B and Chain C ( Figure 6C predicted by this method between Chain B (pink) and Chain C (green) within 5Å distance. residues are depicted in red.   To test the applicability of the pipeline on larger protein complexes, the structure of the alpha To access the predictability of CoRNeA, the results obtained from it for the two test cases residues of 1GCQ and 5YVT. In case of 1GCQ, none of the predicted pairs had a prediction to that predicted by CoRNeA ( figure 6B). Moreover, the final predictions from CoRNeA 4 6 7 yielded in fewer false positives than BIPSPI hence validating the overall improvement in the 4 6 8 accuracy of the prediction of PPI interface residues (Table S6).  CoRNeA can however, be further optimized to reduce the false positive rates as well as features and yield in better and specific results. Predicting the pairwise interacting residues for any two-given pair of proteins from only the 4 8 0 amino acid sequence still remains a challenging problem. In this study, the newly designed partners is a tremendously challenging problem specially for large multimeric complexes. Csárdi G, Nepusz T. The igraph software package for complex network research. secondary structure, backbone angles, contact numbers and solvent accessibility.