Protein Structure Validation and Identification from Unassigned Residual Dipolar Coupling Data Using 2D-PDPA

More than 90% of protein structures submitted to the PDB each year are homologous to some previously characterized protein structure. The extensive resources that are required for structural characterization of proteins can be justified for the 10% of the novel structures, but not for the remaining 90%. This report presents the 2D-PDPA method, which utilizes unassigned residual dipolar coupling in order to address the economics of structure determination of routine proteins by reducing the data acquisition and processing time. 2D-PDPA has been demonstrated to successfully identify the correct structure of an array of proteins that range from 46 to 445 residues in size from a library of 619 decoy structures by using unassigned simulated RDC data. When using experimental data, 2D-PDPA successfully identified the correct NMR structures from the same library of decoy structures. In addition, the most homologous X-ray structure was also identified as the second best structural candidate. Finally, success of 2D-PDPA in identifying and evaluating the most appropriate structure from a set of computationally predicted structures in the case of a previously uncharacterized protein Pf2048.1 has been demonstrated. This protein exhibits less than 20% sequence identity to any protein with known structure and therefore presents a compelling and practical application of our proposed work.


Introduction
Considering the evolutionary mechanisms responsible for the generation of new structures in proteins, it has been speculated that there may be a limited number of unique protein folds -allowing clustering of structures into as few as ten thousand families [1]. This relatively small number has enabled a reformulation of the protein folding problem as a classification problem [2,3]. While it is clear that the folding of at least larger proteins from physical forces (force-field based folding) remains intractable and therefore outside of our computational abilities, the alternative approach of homology-based modeling (such as threading [4][5][6][7]) can easily fall within our computational reach. However, protein modeling by threading techniques is highly dependent on the availability of a comprehensive library of protein folds. Therefore, several international efforts are underway to complete a library of folds [8][9][10]. The structures deposited to the protein databank (PDB, www.rcsb.org) [11] have been very useful to classification methods and also to fragment-based methods that rely on novel structures of shorter sequences within the PDB [7,[12][13][14].
Despite the high rate with which new structures are deposited to the PDB, an analysis of these structures reveals a significant attenuation in the rate of discovery of new folds. Currently the PDB consists of over 83,000 structures of proteins representing approximately 1,400 fold families or topologies based on data published by the RCSB databank using either SCOP [15] or CATH [16] classifications. While the total number of protein structures submitted to the PDB exhibits a very productive and healthy growth [shown in Figure 1(a)], the number of newly discovered novel protein folds has been very limited over the past few years [shown in Figure 1(b)]. The high cost and the required resources can be justified for the characterization of novel protein structures; however the question remains whether more efficient ways of studying common and ordinary proteins can be established. The main contributing factor to the slow rate of discovering novel folds is the selection of target proteins solely based on sequence-homology analysis. Although this method will optimize the coverage of current protein sequence space (the space of all unique protein sequences), it may not be the optimal method of covering protein structure space. The PDB database is already populated with examples of nearly identical protein structures with dissimilar sequences. For instance, deployment of structure alignment techniques such as TALI [17] or msTALI [18] has identified groups {1LXA, 1QRE, 1TDT} or {1A17, 1E1W, 1ELR, 1E96, 1FCH, 1IHG} with sequence similarities of less than 17% and structural similarity of 2.3 Å respectively within each group. The noted inefficiency in discovery rate for novel folds by direct structure determination has motivated development of rapid and cost-effective approaches to structure determination including computational modeling of protein structures.
On other fronts, computational modeling approaches have advanced considerably in the last decade. Introduction of the Ab-Initio modeling techniques such as ROSETTA [13] and I-TASSER [5,7] during CASP VIII demonstrated the possibility of structure modeling in the absence of extended regions of sequence identity to any existing structure. Although these tools have made many significant advances, the community of structural biologists remains reluctant to completely rely on computationally modeled structures. This is partially due to the fact that modeling techniques often produce an ensemble of structures with potentially as much as 10 Å of structural diversity observed over the ensemble of the modeled structures. This high degree of inconsistency further complicates interpretation of their resultant structures. The combination of experimental data and computational modeling tools in programs such as CS-ROSETTA [19] or RDC-ROSETTA [20] have demonstrated significant improvement in eliminating some of this ambiguity. However, these tools utilize assigned NMR data, which significantly increase the data acquisition requirements of NMR spectroscopy and therefore diminish the incentive in using computational tools. It is important to note that the most time consuming and expensive portion of NMR data acquisition is related to resonance assignments. Therefore development of methods that utilize unassigned NMR data will restore the original motivation in using computational modeling tools. Here we present two-Dimensional Probability Density Profile Analysis (2D-PDPA), a significant improvement over a previously published method of validating, or identifying the structure of an unknown protein by using unassigned Residual Dipolar Coupling (RDC) data [21,22]. RDC data have been shown to be a very rich source of information about the structure and dynamics of proteins that can be acquired quickly on samples with more limited isotopic labeling. RDCs have been used in studies of carbohydrates [23][24][25], nucleic acids [26][27][28][29] and proteins [30][31][32][33][34]. The use of RDCs as the main source of structural information has led to a significant reduction in data collection and analysis, while providing the possibility of resonance assignment [35][36][37][38], and identification of dynamical regions [39][40][41]. Assigned RDC data have also been utilized in a number of instances for identification of homologous structures [32,42,43]. Another category of investigations focus on development of simultaneous assignment and structure determination from RDC data [44,45]. While these methods help in extending the frontiers of science, they do not serve as an appropriate screening tool because they either rely on enormous amounts of RDC data acquired in multiple alignment media, or assist in assignment of RDCs to an a-priori known structure. Finally from the practical standpoint, acquisition of RDC data imposes the additional requirement for successful preparation of alignment media. This issue is continually mitigated through introduction of new alignment media [46]. The large-scale applicability of RDC acquisition has been established by the Structural Genomics centers (such as NESG http://spine.nesg.org/rdc.cgi) [47], where a large fraction of their target NMR proteins (if not all) have been subjected to RDC data acquisition.
Relinquishing the need for assignment of NMR data significantly reduces the financial and temporal cost of data acquisition. Identifying a homologous structure for an unknown protein using 2D-PDPA should be of direct interest to structural biologists and pharmaceutical researchers, since they operate under the same general constraints as the structural genomic centers, which consist of reducing the cost of operation and increasing productivity. Rapid and cost effective methods of identifying protein structures, which are truly novel, could also serve to increase the general efficiency of structure determination. 2D-PDPA can also provide an optimal method of validating computationally obtained structures using a minimal set of empirical data [33,41]. This can be of benefit to pharmaceutical endeavors where researchers are often interested in validating the solution structure of a protein in the presence of a ligand relative to that of a protein, based on a structure obtained by X-ray crystallography. There are reported instances of significant structural differences between X-ray and NMR structures of the same protein (having more than 99% sequence identity). For example, structures 1HNG [48] (X-ray structure) and 1A64 [49] (NMR structure) are identical sequences but exhibit 20.9 Å of structural difference as measured over their backbone atoms. The 2D-PDPA source code has been developed in the C++ Object Oriented Programming paradigm and can be downloaded from http://ifestos.cse.sc.edu/ [50]. The current version of this program is capable of deploying on either a typical desktop environment, or Linux clusters equipped with the qsub scheduling protocol.
The overall approach to validation and evaluation of the presented work consisted of three distinct tiers of experimentation with gradually increasing complexity and practical applicability. The first tier consisted of application of 2D-PDPA to a collection of proteins spanning a spectrum of sizes and structural attributes (α, α/β, or β) for which synthetic RDC data were computed. Results of this tier are used to establish a theoretical basis of the investigated mechanism under controlled conditions. The second tier consisted of application of our method to proteins with experimental RDC data. Results of this tier are used to validate the practical applicability of the proposed method. The final tier of our experiments focused on application of the proposed method to a novel protein (Pf2048.1) for which only modeled structures were available. Fitness of each computed model was determined by 2D-PDPA.

Results and Discussion
Our strategy in establishing the effectiveness of the presented work is to subject it to increasingly more challenging test cases. In the following sections we first present the performance of 2D-PDPA to test cases with simulated RDC data. The simulated cases allow for study of a method's performance under carefully controlled conditions. It is important to note that in our studies simulated data are not error-free and they are produced to replicate the noisy experimental data as closely as possible. Following the simulated data, we present results for test cases based on experimental data that reflect the pragmatic condition. In this category, our results are first focused on instance of proteins for which both NMR and X-ray structures are known. Finally we present results of 2D-PDPA in ranking of computationally modeled structures for a target protein with no known structure.

Structure Identification from Simulated RDC Data
2D-PDPA was validated using synthetic data generated from eleven different protein structures (listed in Table 1) to represent a spectrum of sizes and structure types. Data from each protein structure was used to identify the correct structure from a library of 619 decoy representative structures. In each test case, the decoy structures that were not within ± 20% size of the target structure were eliminated from the pool of potential candidates. This filtering mechanism reduced the list of possible structural candidates to within 100 for proteins with less than 120 residues in length, and around 20 for larger proteins (more than 250 residues). The identification results of 2D-PDPA on the eleven randomly selected test proteins are shown in Table 1. The first column of this table lists the PDB-ID of each protein, followed by the protein size (based on number of N-H vectors), the magnitude of the uniformly added noise, and the ranking of each protein by 2D-PDPA.

Structure Identification Using Experimental RDC Data
A search through the BMRB [51,52] database resulted in three proteins with backbone RDC data from two or more alignment media. These three proteins consisted of 1P7E [53], 1D3Z [54] and 1RWD [44] with backbone N-H RDC data from two alignment media. Structural homologues (both NMR and X-ray when possible) were added to our existing database of 619 decoy structures to examine 2D-PDPA's ability to identify the actual or any homologous structures. Table 2 shows the results for the protein structure 1P7E. The structure 1P7E was identified as the highest plausible structure by the 2D-PDPA as expected. Of even more interest, however, are 2 nd and 3 rd place rankings, which consisted of 1IGD and 1P7F. These are the structural homologues added to the library, and are ranked 2 nd and 3 rd respectively. The structures 1P7E, 1IGD and 1P7F exhibit around 1.0 Å of difference measured over the backbone atoms as shown in Figure 2. These results exhibit 2D-PDPA's ability to identify not only the identical structure from a library of decoys, but also other homologous structures. Of even more importance is the fact that this experiment was performed with relatively small amounts (43 RDCs from 55 residues, 78%) of experimental data.  For 1RWD (results shown in Table 3), its X-Ray determined homologue 1BRF (bb-rmsd of 1.8 Å with respect to 1RWD as shown in Figure 3) ranked first. The 1RWD structure ranked second behind 1BRF. At first it may seem odd that the X-Ray structure outranked the NMR structure. However, although 2D-PDPA ranks 1BRF as the better suited structure, the ranking score of 1BRF is negligibly better than 1RWD. Furthermore, it is generally accepted that X-Ray structures fit RDC data better than NMR structures. This experiment once again demonstrates 2D-PDPA's success in finding structural homologues within a large library of possible structures.  Finally the results of structure identification for the protein 1D3Z are shown in Table 4. The NMR structure 1D3Z is ranked first, followed by its X-Ray structural homologue 1UBQ (bb-rmsd of 0.5-1.5 Å with respect to 1D3Z NMR ensemble of structures with a sample shown in Figure 4). Of additional interest is the third ranked structure 1SF0. At first glance this protein exhibits no recognizable sequence (shown in Figure 5) or structural homology with respect to 1D3Z ( Figure 6(a,b)) despite the high score that is produced by 2D-PDPA. This high ranking of a seemingly unrelated protein elicited further investigations. The lack of sequence similarity is of lesser concern since there are noted instance of structural similarity in the absence of sequence similarity [17,18]. The structural similarity between 1D3Z, 1UBQ and 1SF0 was ascertained by the program msTALI [18]. Results of multiple structure alignment conducted by msTALI are shown in Figure 5. msTALI provides structural alignment that are reported in a manner similar to sequence alignment with the difference that the alignment is based on structural similarity. Results of structural alignment (shown in Figure 5) clearly indicate structural similarity over a large fraction of the three proteins with little regions of dissimilarity (indicated as gaps). The three structures exhibit 2.81 Å of similarity measured over the backbone atoms (as shown in Figure 6(c)), which indicates significant structural similarity.   Final core has 66 residues, a score of 6.13535, and an RMSD of 2.81569 1D3Z.pdb mqifvktltgktit-levepsdtienvkakiqdkegippdqqrlifagkqledgrtlsdy 1SF0.pdb kmikvkvigrniekeiewregmkvrdilravg----fntesaiakvngkvvleddevk--1UBQ.pdb mqifvktltgktit-levepsdtienvkakiqdkegippdqqrlifagkqledgrtlsdy Core ************** ***************** ********************** 1D3Z.pdb niqkestlhlvlrlrgg 1SF0.pdb -dg--dfvevipvvsgg 1UBQ.pdb niqkestlhlvlrlrgg Core ** ********* **

Computationally Modeled Structures of PF2048.1
An ensemble consisting of ten modeled structures from ROBETTA [13,22] and five modeled structures from I-TASSER [5] for the unknown protein PF2048.1 were obtained (superimposed structures shown in Figure 7). Table 5 lists the results for an exhaustive pairwise comparison of the ensemble of fifteen structures measured over the backbone atomic positions. In this table, structures R1-R10 and I1-I5 correspond to the ROBETTA and I-TASSER structures respectively. The areas of this table that are shaded in green or yellow correspond to the intra-modeling distances, while the dark-blue areas correspond to the inter-modeling distances. Based on these results, structures modeled by ROBETTA exhibit structural similarity in the range of 2.91 Å-7.83 Å while structures modeled by I-TASSER exhibit more convergence with structural similarity in the range of 1.21 Å-3.62 Å. It is clear from this exercise that both methods have been successful in producing a reasonable model of the structure since all of them consist of a bundle of four helices. It is also clear that in the absence of a-priori knowledge of the protein's structure, selection of the most suitable structure would have not been possible. Due to the general lack of convergence in the modeled structures, arbitrary selection of a model could lead to an erroneous structure.

2D-PDPA Ranking of the Modeled Structures
2D-PDPA was applied to the ensemble of ten modeled structures of T12 by ROBETTA and five models by I-TASSER. Due to experimental conditions only 49 RDC data points were obtained from this protein in two alignment media. Considering the size of the PF2048.1 protein (79 residues), 49 RDC data points constitutes only 62% of the complete data set (38% missing data). The relative order tensors describing the alignment of this protein in each of the media were determined using the previously reported 2D-RDC [55] method (λ-map shown in Figure 15) and are listed in Table 12. Results of the 2D-PDPA ranking of ROBETTA and I-TASSER structures are shown in Table 6 and Table 7 respectively. The three columns in these tables list the structural identifiers, 2D-PDPA's raw score for each structure, and the corrected scores respectively. The corrected scores are based on contribution of the percentage missing data on the raw score (discussed in Section 0) and are computed as shown in Equation (2). By selecting a reasonably stringent raw score of 0.8 (corrected score of 0.42) as the cutoff threshold for structural quality, the list of fifteen structures can be reduced to five; R5 and R1 of the ROBETTA structures, and I5, I4, and I2 of the I-TASSER. Figure 8 illustrates the superposition of these five structures with an average BB-RMSD of 2.53 Å. The emergence of structural convergence among the top five selected structures signifies the systematic selection mechanism of 2D-PDPA. It is important to note that 2D-PDPA's selection mechanism is exclusively based on fitness to the experimental data and not simply based on clustering of the BB-RMSD data shown in Table 5. This independent and yet consistent selection between 2D-PDPA and BB-rmsd provides a strong evidence for accuracy of the top five structures.

Interpretation of 2D-PDPA Results for Modeled Structures of Pf2048
Results listed in Table 6 and 7 rank the fitness of the modeled structures. However these results do not provide any information regarding the accuracy of the modeled structures with respect to the solution state structure of this protein. This information can be retrieved from further analysis of the raw scores that are provided by 2D-PDPA. To interpret the results of 2D-PDPA meaningfully, a simulation exercise has been conducted to relate the PDPA fitness score to backbone RMSD. Here we have utilized protein 1A1Z (83 residues) as a comparable structure to PF2048.1 on the basis of its size and α-helical nature. RDC data have been computed for these two proteins using typically observed order tensors as shown in Table 12. Each dataset has been corrupted through the addition of ±0.5 Hz of uniformly distributed noise. One thousand derivative structures have been generated from the native structure by randomly perturbing the backbone dihedral angles (φ, ψ). The set of derivative structures provided a sampling of the bb-rmsd in the range of 0-8 Å with respect to the starting structure. The 2D-PDPA procedure was then applied to the set of 1000 sample structures. Figure 9 shows the scatter plot of 2D-PDPA scores versus the backbone rmsd's. This figure is very valuable in establishing the operational limits of 2D-PDPA as a function of data quality, and help in interpreting the results shown in Tables 6 and 7. Based on the extrapolated upper and lower boundaries, the scores of 2D-PDPA can be converted to a range of bb-rmsd with respect to the solution state structure of the Pf2048. Table 8 lists the lower and upper estimates of bb-rmsd for each of the top five modeled structures. Therefore it can be concluded with high certainty that the R5 and the I5 structures are within 3 Å of the solution state structure of the PF2048.1.

General Experimental Approach and Targeted Protein Structures
A list of protein structures that were utilized during tier 1 and tier 2 of this study are shown in Table9. These structures range in size from 53 to 364 residues. Table 9 also shows the CATH [16] classification code for the selected structures. For structures 1NCX and 3FIB, CATH has split the sequence into two separate domains. Since 2D-PDPA uses experimental data taken from an entire structure and not an individual domain, the separate CATH domains of 1NCX and 3FIB have not been split into separate structure files. NMR experimental data was retrieved from the BMRB [51,56] for structures 1RWD [44], 1D3Z [54], and 1P7E [27]. Note that structures 1BRF [57] and 1RWD are considered structural homologues with a structural similarity of 1.79 Å measured over the backbone atoms. 1BRF and 1RWD are practically the X-Ray and NMR structures of the same protein respectively.

Simulated and Experimental RDC Data of Target Proteins
Simulated RDC data is very important in validating the basic fundamentals and theory of any analysis. For the first phase of evaluation, synthetic RDCs for N-H vectors from 2 alignment media have been generated for each of the candidate structures shown in Table 9. Table 10 contains the order tensors used to generate the RDC data. Columns 2-6 of this table list the individual elements of the order tensors, and columns 7-8 list their corresponding axial and rhombic components of anisotropy. These order tensors have been selected to reflect alignment properties similar to other experimentally observed alignment tensors. A uniformly distributed error of ±1Hz was added to each individual data point in order to better simulate experimental conditions. The atomic coordinates of the test proteins shown in Table 9 were downloaded from the PDB and the N-H vectors were extracted and stored in REDCAT format [58]. X-Ray structures were protonated using the program Reduce [59]. NMR structures that are normally reported as an ensemble of converged structures were reduced to one representative by selecting the first model in the ensemble.
The BMRB database [51,52] was screened for proteins with experimentally acquired RDC data from two or more alignment media. The list of potential protein structures was further filtered based on availability of a homologous X-ray structure in order to establish the true applicability of our approach. The final list consisted of three proteins: 1RWD [44], 1D3Z [54] and 1P7E [53]. 1P7E is the third IgG-Binding domain of protein G (GB3) (57 residues) which was refined from an X-Ray structure (1IGD) using residual dipolar couplings. 1RWD is a mutant of rubredoxin from P. furiosis (53 residues), which was determined entirely from residual dipolar couplings; it is structurally similar to the protein 1BRF [57] that has been characterized by X-ray crystallography. The two structures 1BRF and 1RWD exhibit a structural similarity of 1.79 Å measured over the backbone atoms.1D3Z is a 76 residue Ubiquitin protein from Homo sapiens and its structure was determined with carbonyl chemical shifts that were acquired by NMR spectroscopy. Structure of this protein has also been determined by X-ray crystallography, 1UBQ [57], which exhibits 0.5-1.5 Å of bb-rmsd with respect to the ensemble of NMR structure 1D3Z. Concerns can be expressed over the degree of anti-correlation that is required by 2D-PDPA between the two alignment media. Although two orthogonal and non-correlated set of data are always desirable, in practice they may not be available. To address this issue the scatter plots of RDC data in two alignment media for proteins 1P7E and 1D3Z are shown in Figure 10(a,b), respectively. The RDC data sets for proteins 1P7E and 1D3Z exhibit R2 correlation of 0.83 and 0.67 respectively, indicating significant linear dependence between the two observed alignment media. These two datasets are presented as examples of RDC data with high degree of correlation as test cases to the 2D-PDPA method.

Library of Structures Representing Protein Fold Families
A library of 619 decoy structures has been used to evaluate the success of 2D-PDPA in large-scale applications. These 619 structures are the family-fold representatives of the entire PDB database in 2005 determined by FSSP (http://www.ebi.ac.uk/) [60]. Use of this library of structures was necessary for the comparison of our results to previously published work. It is important to keep in mind that although the content of the PDB database has increased significantly since 2005, the total number of distinct families of protein folds has not. SCOP [15], CATH [16], and FSSP [59] report the total number of family folds as 1393, 1233 and 2860 respectively.
The 619 structures encompass proteins ranging in size from 45 residues to over 450 residues long. Many protein structures, especially those determined by NMR spectroscopy, are reported to the PDB as an ensemble of candidate structures. In such instances, the first structure in the PDB file is used as the representative. Protein structures that had been determined by X-ray crystallography were protonated by the software package Reduce [59]. As the final step in preprocessing, atomic coordinates of the backbone N and H atoms were extracted and stored in the REDCAT [58] format. This final preprocessing step was performed to streamline our search algorithm. The 2D-PDPA software package is delivered with tools to expand the library of 619 structures, or create customized library of structures. In addition, the library of 619 libraries will soon be updated to include the complete CATH family fold representatives.

NMR Sample Preparation, Data Acquisition and Data
NMR sample preparation and the procedure for alignment of the Pf2048.1 with filamentous phage Pf1 has previously been described [22]. The procedure for expression and purification of the Pf2048.1 protein has also been reported previously [22]. Here we briefly highlight some of the critical aspects of this protocol and report on the data acquisition and sample preparation for a second set of RDC data in an alkyl-polyethyleneglycol (PEG) alignment medium. PF2048.1 was prepared for measurements under isotropic conditions at a concentration of 1.6 mM in 20 mM Tris and 70 mM NaCl at pH 7. All samples also contained 2 mM DTT, 0.02% azide, 1 mM DSS and 10% D 2 O. After isotropic data collection, the PF2048.1 sample was used to prepare two partially aligned samples to satisfy this requirement. A sample with Pf1 phage as the alignment medium [61] was prepared as described before [22]. A second aligned sample was prepared in 4% C 12 E 5 (PEG, Sigma Aldrich, St. Louis, MO, USA) using previously published protocols [62]. In both cases protein samples were diluted with concentrated alignment medium in sample buffer (16% PEG, for example). Final, protein concentrations in aligned media are approximately 1.2 mM.
NMR data were collected on a Varian Unity Inova 600 MHz spectrometer at 298K using a conventional z-gradient triple resonance probe or a z-gradient triple resonance cryogenic probe (Varian Inc., Palo Alto, CA, USA). The one-bond 1 H-15 N couplings for isotropic and aligned samples were measured using 15 N-IPAP-HSQC experiments [63]. Data collection included 256 t1 points, and 2048 t2 points collected over 12 h. Residual dipolar couplings were calculated as the difference of the couplings measured in the aligned and isotropic conditions. All data were processed using NMRPipe and visualized using NMRDraw [64] as previously described [22].

Computational Modeling of Pf2048.1
PF2048.1 is a 9.16 kDa, 78 residues; (including His-tag) monomeric protein with less than 26% sequence identity to any structurally characterized protein. A total of fifteen structural models were obtained from ROBETTA and I-TASSER modeling tools available online at http://robetta.bakerlab.org [13] and http://zhang.bioinformatics.ku.edu/I-TASSER/ [5] respectively. Both servers accept the primary sequence of a protein and return a number of modeled structures. In this instance Robetta produced 10 structural models and I-TASSER produced 5 structural models as shown in Section 0. I-TASSER and ROBETTA (a derivative of ROSETTA) are consistently highly ranked in the CASP competitions. Both of these modeling tools leverage known structural information for homologous segments of the unknown protein and perform ab-inito calculations for the remaining portions of the protein

Outline of 2D-PDPA Method
2D-PDPA is an extension of the 1D-PDPA method [21,22] that allows simultaneous analysis of RDC data from a second alignment medium. Simultaneous consideration of RDC data from two alignment media has not previously been explored due to its computational time requirement. Our 2D-PDPA method has provided a computationally feasible approach that places a more robust system of scrutiny on the candidate structures. The overall principle that 2D-PDPA utilizes is that two similar structures must exhibit a similar distribution of RDC data as shown in Figure 11. In this figure, the distribution of RDC points is a function of the protein structure and can be used as a structural fingerprint of an unknown protein. Therefore a measure of similarity between two distributions of RDC data can be interpreted as a measure of structural similarity. Figure 11. An example of a 2D-PDP map generated using kernel density estimation. This 2D-PDP can serve as a structural fingerprint.
Overall operations of 2D-PDPA proceed in three main stages as shown in Figure 12. During the first stage, experimental RDC data are analyzed to estimate seven of the ten needed parameters [55,65] that are used to back-calculate RDC data from any given structure in two alignment media. During this stage, scattering of the RDC data in two alignment media is converted to a distribution function using Kernel Density Estimation [2,3,21]. This distribution is constructed through superposition of Gaussian kernels that are centered at each RDC data point. Figure 11 illustrates an example distribution map that is denoted as ePDP throughout this report. An ePDPA is referred to as a distribution map (or a fingerprint) that is generated from experimental data. During the second phase of 2D-PDPA, a similar map is created based on the back-calculated RDC data from each of the protein structures available in the library of structures using the same Kernel Density Estimation procedure. The computed maps are denoted as the cPDP's. For each structure in the database, a cPDP is created for each possible rotation of the structure in a grid search over the Euler angles (α, β, γ) at a resolution of 5°. Each of these cPDP's is compared to the ePDP and the best score as well as the corresponding Euler angles are recorded for each structure in the database. These 46,656 (36 × 36 × 36 rotations over α, β and γ) alternate cPDP's are created as a result of a 5° grid search over the three remaining parameters that are needed for back-calculation of the RDC data. These three remaining parameters essentially represent all possible orientations of any given structure. RDCs are insensitive to 180° rotations; hence the search space can be reduced to a range of [0°-180°] in increments of 5° for each parameter. The best matching score and its corresponding three search parameters are recorded for the third and final stage of 2D-PDPA. During the concluding stage of the 2D-PDPA, all of the proteins in the library of structures are ranked based on their 2D-PDPA fitness, which was measured during previous stage, and the results are reported.

Scoring and Interpretation of 2D-PDPA Raw Scores
In contrast to 1D-PDPA [21,22] that utilizes 2 χ metric [3] of comparison, 2D-PDPA employs a more intuitive Manhattan (or City-Block) metric [3] for comparison of cPDP and ePDP. Equation (1) describes the Manhattan distance that is computed by 2D-PDPA. In this equation B denotes the 2D-PDPA's raw score (Block score), the summation indices i and j traverse the entire range of RDCs over the two alignment media M 1 and M 2 , and δ i and δ j denote the step size of uniform grid sampling along each of the RDC dimensions. In this equation cPDP ij and ePDP ij represent the likelihood reported by each PDP set at locations i and j. Since the cPDP and ePDP are normalized to be a qualified probability density functions, their integral over the entire range of RDCs equates to one. Therefore the B-score will have an effective range of [0-2], where a score of 0 indicates 100% similarity and a score of 2 indicates 0% similarity between the two structures. Furthermore, when the B-score is normalized by a factor of ½, it can be interpreted as a fraction of structural dissimilarity between the query protein and the unknown target protein. This mechanism of interpretation can be used in establishing a threshold for acceptability of a ranked sample structure. A number of factors such as: quality of the experimental data and completeness of data need to be considered in interpretation of the B-scores. However unlike utility of assigned RDC data, normalization based on strength of alignment (such as Q-factor) is not needed. This is due to the fact that 2D-PDPA is based on comparison of distribution of RDCs and not direct comparison of RDCs: In instances where meaningful bb-rmsd values can be calculated (such as the Pf2048 exercise) between the members of the search database and the unknown protein, a more informative relationship between the 2D-PDPA's B-score and the expected bb-rmsd can be established. Such interpretation patterns can be created based on the following observations: 1. Interpretations patterns are primarily a function of class of protein structure (α of β protein) and protein size 2. Interpretation patterns depend on completeness of data 3. Interpretation patterns exhibit a dependency on quality of experimental data, and more directly on the quality of the two estimated order tensors The latter dependency is intuitive and is investigated in the literature [21,22,55,65] and it is therefore not discussed further in this report. We demonstrate the first and second above dependencies by generating a scatter plot of bb-rmsd versus their corresponding B-score for 1000 derivative structures. These derivative structures were generated by randomly altering backbone dihedral angles of the native structure for a given protein. The ensemble of altered structures was used to compute a B-score and bb-rmsd with respect to the native structure. In this exercise we have used two sample α-proteins (1A1Z and 2M67) and two sample β-proteins (1F53 and 1PMR) that are approximately of equal sizes. Table 11 shows the detailed information for each of these four proteins. It is important to note that the two proteins in each structural class are unrelated. Figure 13 illustrates the interpretation patterns for each of the two classes. The two patterns are remarkably well conserved between the two proteins from the same structural class. Any noted differences are due to random sampling of the space and will be resolved by increasing the number of random sampling.  These interpretation patterns also exhibit a very predictable behavior as a function of missing data. To illustrate this point, we performed a similar exercise as above on the α-protein set {1A1Z, 2M65} by randomly removing 25% and 30% of the data. The final results are shown in Figure 14 and as expected, the lowest scores correspond to the percentage of missing data [shown in Equation (2)]. This exercise was repeated for a number of other proteins with very similar results (not shown here). Based on this observation, a corrected score can be computed by subtracting the fraction of missing data from the raw score. This correction eliminates the contribution of missing data and allows for easier comparison of 2D-PDPA's scoring mechanism across different instances of analyses: Figure 14. Sensitivity of 2D-PDPA analysis as a function of bb-rmsd on two α-proteins 1A1Z and 2M67 (a) with 25% of the data randomly removed and (b) with 30% of the data randomly removed.
The noted properties of the 2D-PDPA's Block scoring mechanism enables creation of an interpretation pattern from another protein with similar structural attributes as the target protein. The resultant interpretation pattern can then be used to establish the quality of the 2D-PDPA's selected structure. We have utilized this mechanism to establish the quality of modeled structures for Pf2048.

Computational Facilities
Large-scale applications of 2D-PDPA require execution times that exceed 24 h when implemented on a typical desktop computer. To expedite our data analysis, 2D-PDPA has been ported and utilized on a Linux cluster. This high-performance computing platform consisted of a 76-node 152-core Intel Xeon

Residual Dipolar Coupling
Residual Dipolar Couplings (RDCs) are obtained from the Nuclear Magnetic Resonance Spectroscopy (NMR) of weakly aligned samples. Although RDCs had been observed as early as 1963 [66] in nematic environments, they have only recently become more commonly used for direct investigation of molecular structures and internal dynamics. RDCs have been the subject of a number of reviews [46,67]. They have been used by the community of investigators in application to structure determination of proteins [33,34,[68][69][70][71][72], nucleic acids [71,[73][74][75][76] and carbohydrates [77][78][79][80]. RDCs have been utilized in structure determination of challenging proteins such as membrane proteins [81][82][83][84], homo-oligemeric proteins [70] and in the study of dynamics [75,85,86]. RDCs have also been used in the assembly of molecular complexes from individual domains [78]. Residual Dipolar Couplings are a measurement of a dipole-dipole interaction between two spin systems as aligned in the external magnetic field of an NMR instrument. Equation (3) shows the time-average formulation of the RDC phenomenon for two spin ½ nuclei. Here, i and j denote the two nuclei and θ is the angle between the inter-nuclear vector and the magnetic field. The angle brackets in Equation (3) denote a time averaging, and D max is the maximum observable RDC value for a particular pair of nuclei as defined in Equation (4): In Equation (4), µ 0 is the magnetic permeability of free space, i and j are the gyro-magnetic ratios of the two corresponding nuclei, r is the distance of the inter-nuclear vector between i and j and ħ is the normalized Planck's constant. For isotropically tumbling molecules, the time-average observed RDC value from Equation (3) is reduced to zero. Observation of the RDC interaction, therefore, requires perturbation of the isotropic tumbling of molecules by introducing an alignment medium. Examples of alignment media include liquid crystalline bicelles, filamentous bacteriophage (Phage), polyacrylamide gel (PAG), and alkyl-polyethylene glycol (PEG) detergents in water [65,87] to name a few. The use of such alignment media induces an anisotropic distribution of orientations for tumbling molecules, therefore allowing for non-zero RDC values to be observed. Subsuming the effects of time averaging and proper factorization of the RDC interaction produces the final formulation of the RDC interaction as shown in Equation (5).
s xx +s yy +s zz = 0 (6) In Equation (5), v ij is the unit inter-nuclear vector between atoms i and j, and R(α, β,γ) is an Euler rotation (defined by the three angles α, β and γ) that describe the preferred alignment of the structure with relation to any arbitrary orientation of the protein. The preferred orientation of the molecule within the NMR magnetic field is referred to as the Principal Alignment Frame (PAF). The principal order parameters s xx , s yy and s zz describe the degree of order (or strength of alignment) along each of the main axes of alignment. Equation (5) can be used in various contexts [21,33,34,55,58,65], however within the scope of this work it has been used to back-calculate RDC data for a given structure and estimate principal order parameters s xx , s yy and s zz .

Estimation of Order Tensor and Calculation of RDCs
Back calculation of RDC data for a given structure is of central importance to the work that is presented here. Equation (5) can be used to conveniently back calculate RDC data for a given protein structure in the presence of additional information. In Equation (5), the coordinates for any interacting vector v ij can be obtained from a given PDB file. Five additional parameters [α, β, γ, s yy , s zz ; note that s xx can be reconstructed from Equation (6)] per alignment medium need to be estimated to describe the alignment of the protein in each of the anisotropic media. Therefore, utilization of RDC data from n alignment media requires the estimation of 5n parameters. Due to the difficulty in estimating these parameters, the utility of RDC data from multiple alignment media has been limited in the past. However, recent developments [55,65] have demonstrated the possibility of accurately reconstructing 5n-3 of the needed parameters from analysis of the unassigned RDC data in the absence of structural information. The remaining three parameters, namely the α, β and γ of the molecular frame with respect to the principal alignment frame of the first order tensor [22,55,65] can be obtained via a grid search. Figure 15 illustrates the results of the 2D-RDC analysis method in estimating the relative order tensors for the unknown protein Pf2048.1. Section 4.11 provides the details of the proposed approach and how the listed order tensor in Table 12 is used throughout our analyses.

Conclusions
Computational modeling tools have made significant advances in recent years including structure determination of proteins with less than 30% sequence identity to any existing protein structure. Despite these advances, confidence in computationally modeled structures remains low especially when the ensemble of modeled structures lacks any distinct convergence in structure. As we have shown in Section 5.3, even in the presence of consistency in modeled structures, it is likely that one of the structures is a significantly better representation of the native structure than others. Interpretation of unassigned RDC data by 2D-PDPA can be very instrumental in ranking these structures and, therefore, increasing the value of computational modeling tools as another acceptable avenue of structure determination other than NMR and X-ray crystallography.
Based on the data reported by the Protein DataBank [as shown in Figure 1(b)] it can be indisputably concluded that a large portion of proteins (more than 90%) that are characterized by the general community of investigators are redundant structures. The structure determination protocols as commonly practiced do not distinguish between novel or common protein structures and therefore lead to a fixed cost of structure determination. Application of 2D-PDPA at a stage before undertaking NMR resonance assignments can identify redundant structures and give the option of halting structure determination before additional costs are incurred. It stands to reason that the cost of structure determination should be proportional to the novelty of the target protein. In Section 5.1 we have demonstrated the success of 2D-PDPA in identification of the fittest structure in a large-scale application. In addition, in Section 0 we have demonstrated success of 2D-PDPA in identification of the closest modeled structure to the native structure of Pf2048.1. Based on these results it is easy to envision a protocol where novelty of a structure is determined as the first step to structure determination before commitment of the full spectrum of data acquisition or crystallization experiments. If the unknown protein is determined to be a common protein, then computational modeling tools followed by proper ranking and validation of the modeled structures can be deployed as a viable and cost effective method of structure determination. If the unknown protein is deemed to be novel, then it can be subjected to a full experimental method of structure determination. This proposed structure determination approach could lead to a significant reduction in the average cost of structure determination while helping to identify novel structures for initiatives such as the Protein Structure Initiative or Structural Genomics Initiatives.