This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution license (http://creativecommons.org/licenses/by/3.0/).

Several efficient algorithms to conduct pairwise comparisons among large databases of protein structures have emerged in the recent literature. The central theme is the design of a measure between the _{α}

The problem of aligning protein structures to infer functional similarity is a stalwart within the arena of computational biology; see [

Classical structure alignment algorithms like Dali [

The modern cohort of fast structural alignment algorithms sacrifices accuracy with regard to the pairwise comparisons in exchange for increased speed. Many of the fast algorithms (so-called 1D algorithms [

The speed with which the most recent algorithms can compare proteins prompts a wealth of numerical studies that would have been previously difficult, if not impossible. We harness this efficiency to test if the structural alignment algorithm eigen-decomposition with the spectrum (EIGAs) is robust with regards to parametric variation and modeling uncertainty. Like many of the recent, efficient algorithms, EIGAs uses dynamic programming (DP) as its underlying computational framework. DP depends on parameters that generally need to be tuned to the application and, in some cases, to the particular problem instance. This begs the question of how sensitive the recent DP-based algorithms are to selecting quality parameters. If parameters have to be selected somewhat precisely to obtain accuracy and/or have to be tuned per dataset, then the technique would lack sufficient robustness to instill confidence on much larger datasets, which would preclude individualized parametric tuning. We show that EIGAs is robust to parameter selection by showing that it remains highly effective at identifying structurally similar proteins over a breadth of parametric values.

The optimization model solved by DP depends on a measure of the structural similarity between two proteins. However, the coordinates of a residue are not known with perfect certainty, and the structural similarity measure and, subsequently, the alignments themselves are subject to possible adjustments, as a protein's 3D structural description varies within appropriate tolerances. As with parametric variation, if an algorithm is sensitive to small variations in structure, then it will likely perform poorly, since real datasets always have a level of uncertainty. We show that EIGAs is robust against such uncertainty by showing that it identifies a preponderance of structurally similar proteins as the coordinates of the _{α}_{α}

The next section discusses how DP is used to efficiently align protein structures. This discussion points to why DP might have a natural affinity for accounting for parametric variation and structural uncertainty. Sections 3 and 4 specify EIGAs' DP model and detail the computational environment used to conduct our numerical experiments. Section 5 reports on our numerical tests showing that EIGAs is robust against parametric adjustment. Section 6 reports on the robustness of EIGAs as coordinate uncertainty is considered.

Dynamic programming originated with Bellman [

Suppose we want to align two sequences, one indexed by _{ij}

Pair Value 0.1 0.3 0.2 0.2 0.1.

An optimal solution depends on the gap penalty, _{ij}_{ij}

There are numerous variants of the simple recursion in _{o}_{c}

While the discussion above depicts DP favorably, we would be remiss to ignore some of its downsides. All uses of DP for database applications only consider sequential alignments. Non-sequential alignments are important in some, and, indeed, possibly many, cases [

A protein (chain) is a linear polymer of amino acid residues that folds into a unique 3D structure. The amino acid residues are held together with strong covalent bonds, and the covalent backbone is comprised of a repeating pattern of atoms identical for each residue. However, each residue is distinguished by a side chain of atoms that is not on the backbone, but is attached to the alpha carbon backbone atom, denoted _{α}_{α}_{α}

Let _{kt}_{kt}_{kk}_{kk}^{T}^{T}^{−1} and ^{T} R

The similarity between residues _{ij}_{i}_{j}|/|λ_{i}_{j}_{i}_{j}_{p}_{pi}_{q}_{qi}

size 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% assignments 0.42% 4.03% 6.36% 9.15% 11.41% 18.06% 12.18% 9.66% 3.93% 24.79%

Nearly 75% of the residues are assigned eigenvalues whose magnitude is below 90% of the largest eigenvalue, and about 31% of the residues are assigned eigenvalues whose magnitude is below 50% of the largest eigenvalue. Unlike PCA or NMA, EIGAs' assignment associates each residue with the eigenvalue of the nearest eigenspace, and as the distribution above shows, many of a protein's more minor eigenvalues are used by EIGAs.

The EIGAs code is implemented in Python, although portions, such as the DP core, are written in C as Python extensions via SWIG [

EIGAs' computational efficiency compares favorably with other published results. The Eig_7 algorithm in [

The earlier numerical results of EIGAs in [_{ij}_{i}_{j}_{ij}_{i}_{j}_{i}_{j}

Three algorithmic parameters affect alignments, those being the cutoff value, _{o}_{c}_{o}_{c}

An ROC curve is generated for each (_{o}_{c}

Illustrative values assigned to optimal alignments from dynamic programming (DP).

Prot.A | Prot. B | Prot. C | Prot. D | Prot. E | |
---|---|---|---|---|---|

Prot. A | 0.00 | 0.21 | 0.43 | 0.13 | 0.68 |

Prot. B | 0.21 | 0.00 | 0.61 | 0.26 | 0.34 |

Prot. C | 0.43 | 0.61 | 0.00 | 0.80 | 0.08 |

Prot. D | 0.13 | 0.26 | 0.80 | 0.00 | 0.19 |

Prot. E | 0.68 | 0.34 | 0.08 | 0.19 | 0.00 |

| |||||

Family ID | 1 | 1 | 2 | 1 | 2 |

A perfect classification has a TPR of one and an FPR of zero, and hence, it is desirable to have a ROC curve pass as close to this point as possible. A random classifier gives identical TPR and FPR values. Therefore, if an ROC curve is above the TPR = FPR line, then the classification method is better than random classification, but if the ROC curve is below the TPR = FPR line, then the classification is worse than random. A common metric is the area under the ROC curve (AUROC), which is at best one and at worst zero. The area of the ROC curve along with its minimum distance to a perfect classification are the metrics used to evaluate the alignments for each (_{o}_{c}

_{o}_{c}_{o}_{o}_{c}_{c}_{c}

Our computational results show that AUROC exceed 0.95, as long as 7 ≤ _{o}_{c}

The best area under the receiver operating characteristic curve (AUROC) over all (_{o}_{c}

The best AUROC over all (_{c}_{o}

The parameter tuning suggests that _{o}_{c}

Parameters _{o}_{c}

Best Tuned Parameters |
Results on | ||||||
---|---|---|---|---|---|---|---|

| |||||||

AUROC | _{o} |
_{c} |
AUROC | TPR | FPR | ||

| |||||||

Sub-Group 1 | 0.9767 | 19 | 0.3 | 0.3 | 0.9801 | 0.9389 | 0.0724 |

Sub-Group 2 | 0.9959 | 21 | 0.0 | 0.3 | 0.9696 | 0.9067 | 0.0836 |

Sub-Group 3 | 0.9801 | 16 | 0.7 | 0.2 | 0.9728 | 0.9100 | 0.0848 |

The best AUROC over all (_{o}_{c}

Beyond parametric variation, optimal alignments also depend on the _{α}_{ij}

There are two common methods for establishing a protein's structure: X-ray crystallography and nuclear magnetic resonance (NMR) spectroscopy. Each accounts for variability differently, and these differences need to be reconciled to have a consistent study of coordinate uncertainty. We reconcile the differences probabilistically and assume that the coordinates of each _{α}_{o}_{c}

The receiver operating characteristic (ROC) curve for the entire Proteus300 dataset with _{o}_{c}

NMR experiments generate multiple descriptions of a protein in aqueous solution, and pdb files from NMR experiments include numerous 3D models. An overlay of all 20 models from the NMR experiment of the protein structure 1NTR is shown in _{α}_{α}

X-ray crystallography requires that a protein first be crystallized, and a solution must become saturated enough to create a crystalline structure that can then be imaged with X-rays. The advantage over NMR is that X-ray crystallography more easily accommodates large proteins. The disadvantage is that the 3D descriptions are not of the proteins in a natural aqueous solution. Uncertainty for X-ray crystallography experiments is expressed by a list of B-factors, each of which assesses the variation of the spacial coordinates of a specific atom. A B-factor, often called a Debye-Waller factor, is a scaled version of the mean squared displacement of an atom. If _{i}_{α}_{i}^{2}_{α}_{α}

An unperturbed depiction of 1QMPa is shown on the left, and a sample from a random model of the same protein chain is shown in the center (

The reason for scaling the variances is that it provides a tool to adjust coordinate uncertainty, and hence, it can be used to assess an algorithm's sensitivity to coordinate uncertainty. A robust algorithm will remain effective for large values of

Before continuing, we address an important criticism of using uncorrelated distributions for coordinate perturbations. Atomic coordinates are not uncorrelated, and for this reason, the uncorrelated assumption will over estimate coordinate variability. However, should an algorithm remain stable for large

To illustrate the effect of randomizing the coordinates, we consider 1QMPa with each coordinate being distributed as in

We initially varied

On the left, AUROC (top) and the best TPR (middle) and FPR (bottom) are shown as

The success of EIGAs for

The immediate conclusion of our numerical work is that EIGAs is robust against variations that could alter the DP algorithm. Moreover, our adapted three-fold cross-validation suggests that EIGAs can be tuned on moderate-sized datasets and then applied to larger collections. A secondary conclusion is that the modern DP-based algorithms are likely to exhibit similar robustness, since DP can be stable over wide ranges of input data.

The results for large values of

A near term research question is to test several of the other DP algorithms designed for fast alignments to verify whether or not they share EIGAs' robustness. If so, then a logical conclusion would be that DP is the reason why efficient algorithms are finding success for a variety of different similarity models.

The authors are thankful for the assistance of Dr. Eric Reyes for his suggestion on adapting a three-fold cross-validation test, to Dr. Joseph Eichholz for his programming aid and to Dr. Mark Brandt for his guidance on biochemistry. We also thank three anonymous referees for their comments on an earlier version.

The authors declare no conflict of interest.

The chain d1fvqa, which is copper-transporting ATPase, has a single model and no B-factor. Hence, this chain was constant throughout the study.