# Dynamic Programming Used to Align Protein Structures with a Spectrum Is Robust

^{1}

^{2}

^{*}

## Abstract

**:**

_{α}atoms of two protein chains, from which dynamic programming is used to compute an alignment. The efficiency and efficacy of these algorithms allows large-scale computational studies that would have been previously impractical. The computational study herein shows that the structural alignment algorithm eigen-decomposition alignment with the spectrum (EIGAs) is robust against both parametric and structural variation.

## 1. Introduction

_{α}atoms are perturbed randomly. We account for uncertainty by assuming that the atomic coordinates of the C

_{α}atoms are probability distributions, whose standard deviations are scaled B-factors.

## 2. Dynamic Programming and Robustness

_{ij}, to be the similarity between element i of the first sequence and element j of the second, we have that a common recursion defining an optimal alignment is:

Protein A | 1 | _ | 2 | 3 | 4 |

Protein B | 1 | 2 | 3 | 4 | 5 |

Pair Value | 0.1 | 0.3 | 0.2 | 0.2 | 0.1. |

_{ij}. However, solutions are not necessarily sensitive to changes in these values. Indeed, a simple, albeit tedious, calculation shows that the example's unique solution remains optimal over the substantial range of 0.15 < ρ < ∞. Similar calculations show that each individual, S

_{ij}, may vary with ρ = 0.3 over the intervals noted below:

_{o}and ρ

_{c}be the penalties for initiating and continuing a gap, respectively.

## 3. Eigen-Decomposition with the Spectrum

_{α}, for each residue. We represent each residue by the coordinates of its C

_{α}atom. An alignment between two protein chains is a pairing of residues between the proteins, which reduces to a pairing between the C

_{α}atoms along the two backbones.

_{kt}be the distance between residues k and t of a single protein. The smooth contact matrix for the protein is:

_{kt}. We note that C

_{kk}= 1, since d

_{kk}= 0. This property implies that C is positive definite for appropriately selected κ [37], and hence, C can be factored as C = UDU

^{T}, where U

^{T}= U

^{−1}and D is a diagonal matrix of positive eigenvalues. If we let $R=\sqrt{D}{U}^{T}$, then C = R

^{T}R. The column vectors of R are called intrinsic contact vectors, and they support a geometric perspective of the alignment problem [8].

_{ij}= |λ

_{i}− λ

_{j}|/|λ

_{i}+ λ

_{j}|, where λ

_{i}and λ

_{j}are the eigenvalues associated with the two residues. Specifically, residue i is assigned λ

_{p}if |R

_{pi}| = max

_{q}|R

_{qi}|; see [8] for additional details. This eigenvalue assignment is not that of principle component analysis (PCA) or normal mode analysis (NMA), as it includes many smaller eigenvalues, which are typically more sensitive to deviations in the input data. As an example, the distribution of EIGAs' eigenvalue assignments for the Proteous300 dataset is tabulated below.

smallest | largest | |||||||||

size | 10% | 20% | 30% | 40% | 50% | 60% | 70% | 80% | 90% | 100% |

assignments | 0.42% | 4.03% | 6.36% | 9.15% | 11.41% | 18.06% | 12.18% | 9.66% | 3.93% | 24.79% |

## 4. The Computational Setting

_{ij}= |λ

_{i}− λ

_{j}|. EIGAs now uses an affine gap model, and similarity values are normalized as S

_{ij}= |λ

_{i}− λ

_{j}|/|λ

_{i}+ λ

_{j}|, which ensures that the score is always between zero and one. Lastly, the previous benchmark had been a simple count of the proteins whose nearest neighbor had shared the same family classification. Assessments are now made from a receiver operating characteristic (ROC) curve, which is an improved evaluative tool that parameterizes the trade-off between positive and negative results.

## 5. Robustness Against Parametric Variation

_{o}, and the gap continuation penalty, ρ

_{c}. We incremented κ from four to 23 Å with a step size of 1 Å and incremented each of the gap penalties from zero to one with a step size of 0.1. The 30 families of the Proteus300 dataset were randomly divided into three sub-groups of 10 families (100 proteins) each. We adapted a standard three-fold cross-validation test in an attempt to quantify how parametric tuning on a small collection of proteins might foreshadow success on a larger, and possibly disjoint, dataset. Instead of tuning on two of the three sub-groups and then validating on the third, parameters were tuned on a single dataset and then validated on the remaining two. The tuning on each sub-group was over the space of all 20 × 10 × 10 = 2,000 triples (κ,ρ

_{o},ρ

_{c}). Hence, each of the three tunings required $2,000\times \left(\begin{array}{c}100\\ 2\end{array}\right)=9,900,000$ applications of DP, which, at 0.0041 seconds per solve, is about 11.5 hours.

_{o}, ρ

_{c}), which is a parametric plot of the false positive rate (FPR) against the true positive rate (TPR), as the value deciding structural similarity varies. As an example, suppose the optimal alignments among five proteins have the optimal DP values in Table 1. The best (minimum) agreement is between proteins C and E, which are aligned by EIGAs' DP algorithm with a globally optimal value of 0.08. The worst agreement is between proteins A and E, which even when optimally aligned by EIGAs receive a score of 0.68. Let τ satisfy 0.08 ≤ τ ≤ 0.68, and for the moment, consider τ = 0.20. Only three alignments have a value below τ, those being between pairs (A,D), (C,E) and (D,E), but only the first two correctly pair proteins within the same family. There are a total of 10 possible pairs in this example, but only four correctly pair proteins of the same family. Therefore, for τ = 0.20, the TPR is 2/4, i.e., we correctly identified two of the four pairings that share the same family. Since (D,E) was the only false positive out of six possible, the FPR for τ = 0.20 is 1/6. Varying τ over the interval [0.08, 0.68] gives a parametric plot of FPR versus TPR. Note that TPR and FPR are both one for τ ≤ 0.08 and are both zero for τ > 0.68.

Prot.A | Prot. B | Prot. C | Prot. D | Prot. E | |
---|---|---|---|---|---|

Prot. A | 0.00 | 0.21 | 0.43 | 0.13 | 0.68 |

Prot. B | 0.21 | 0.00 | 0.61 | 0.26 | 0.34 |

Prot. C | 0.43 | 0.61 | 0.00 | 0.80 | 0.08 |

Prot. D | 0.13 | 0.26 | 0.80 | 0.00 | 0.19 |

Prot. E | 0.68 | 0.34 | 0.08 | 0.19 | 0.00 |

Family ID | 1 | 1 | 2 | 1 | 2 |

_{o}, ρ

_{c}).

_{o}and ρ

_{c}, the best possible AUROC over the other two parameters is plotted. The sensitivity is greatest with respect to κ, and while AUROC exceeded 0.9 in all cases, the sharp rise to well over 0.95 as κ increases demonstrates that κ should not be small. The best values were κ = 19, 21, and 16 over the three sub-groups for respective best AUROCs of 0.9767, 0.9959 and 0.9801. The alignments were nearly insensitive to ρ

_{o}, as illustrated by the nearly constant graphs in Figure 2. Indeed, each ρ

_{o}yielded an AUROC exceeding 0.97. From Figure 3, we see that ρ

_{c}should not be selected too small, since, if so, the alignments do not accurately predict family classification. However, quality alignments with AUROCs over 0.95 are possible, as long as 0.1 ≤ ρ

_{c}≤ 0.9. The best values of ρ

_{c}over the three sub-groups are 0.3, 0.3 and 0.2 with corresponding AUROCs of 0.9767, 0.9960 and 0.9801.

_{o}≤ 1 and 0.1 ≤ ρ

_{c}≤ 0.9. These parametric ranges are wide and account for over 67% of the tested parameters. We conclude that EIGAs is robust with respect to parametric tuning. This outcome suggests that it might be possible to tune parameters for general use on small datasets. To test this concept, we used the parameters with the best AUROC from each sub-group on the remaining proteins of Proteus300. The results are shown in Table 2, and they demonstrate that it is reasonable to assume that parameter tuning on smaller datasets will result in quality parameters for larger datasets. On average, AUROC degraded from 0.9842 to 0.9742 over the three sub-group tests, although AUROC increased on the first sub-group. The TPR and FPR values listed in Table 2 are those closest to perfect classification, and they show that 90% of EIGAs' alignments correctly pair proteins of the same family, assuming that an appropriate threshold, τ, is selected. Likewise, less than 9% of family associations predicted by EIGAs are incorrect.

**Figure 1.**The best area under the receiver operating characteristic curve (AUROC) over all (ρ

_{o}, ρ

_{c}) for each k. Each of the three curves corresponds with one of the random sub-groups of the Proteus300 dataset.

**Figure 2.**The best AUROC over all (k, ρ

_{c}) for each ρ

_{o}. Each of the three curves corresponds with one of the random sub-groups of the Proteus300 dataset.

_{o}= 0.5 and ρ

_{c}= 0.3 will generally result in alignments of sufficient quality to identify a high percentage of proteins with the same family classification. Using these parameters on the entire Proteus300 dataset gives an AUROC of 0.9787 and a best TPR and FPR of 0.9267 and 0.0704. The ROC curve is shown in Figure 4. As a point of observation, the EIGAs scoring method consistently resulted in the best TPR and FPR for 0.12 ≤ τ ≤ 0.15. The best TPR and FPR for the entire Proteus300 dataset occurred at τ = 0.1458.

**Table 2.**Parameters κ, ρ

_{o}and ρ

_{c}were tuned to give the best possible AUROC for three randomly selected sub-groups of Proteus300. The results of using these parameters on the remaining, disjoint set of proteins are listed in the last three columns. TPR, true positive rate; FPR, false positive rate.

Best Tuned Parameters per Sub-Group | Results on Remaining Proteins | ||||||
---|---|---|---|---|---|---|---|

AUROC | κ | ρ_{o} | ρ_{c} | AUROC | TPR | FPR | |

Sub-Group 1 | 0.9767 | 19 | 0.3 | 0.3 | 0.9801 | 0.9389 | 0.0724 |

Sub-Group 2 | 0.9959 | 21 | 0.0 | 0.3 | 0.9696 | 0.9067 | 0.0836 |

Sub-Group 3 | 0.9801 | 16 | 0.7 | 0.2 | 0.9728 | 0.9100 | 0.0848 |

**Figure 3.**The best AUROC over all (κ, ρ

_{o}) for each ρ

_{c}. Each of the three curves corresponds with one of the random sub-groups of the Proteus300 dataset.

## 6. Robustness Against Coordinate Uncertainty

_{α}coordinates, since they decide the similarity values, S

_{ij}, for a fixed value of κ. As far as we are aware, no computational evaluation has been proposed to measure an algorithm's ability to account for 3D variability, which is not surprising, since earlier alignment algorithms required lengthy computations. However, EIGAs' speed supports the repeated database-wide comparisons needed to study how it reacts as protein descriptions vary over experimental tolerances.

_{α}are random variables that can be modeled from either experiment. We then draw a sample dataset from these distributions and align them with EIGAs. Each run of EIGAs on a sample dataset is evaluated by AUROC, with κ = 19, ρ

_{o}= 0.5 and ρ

_{c}= 0.3, as determined in Section 5.

**Figure 4.**The receiver operating characteristic (ROC) curve for the entire Proteus300 dataset with κ = 19, p

_{o}= 0.5 and p

_{c}= 0.3.

_{α}coordinates over the different models and assume that each C

_{α}coordinate is normally distributed as:

_{i}is the mean of the i-th residue's C

_{α}coordinates over the crystalline structure and r

_{i}is the random position of the atom, then the B-factor of the i-th residue is:

^{2}is the variance. To assess EIGAs' sensitivity to coordinate perturbation, we assume that the coordinates of the C

_{α}atoms are uncorrelated normals, whose variances are scalar multiples of a third of ${\sigma}_{i}^{2}$. Hence, for each X-ray crystallography experiment, the coordinates of the C

_{α}atoms are the random variables:

**Figure 5.**An unperturbed depiction of 1QMPa is shown on the left, and a sample from a random model of the same protein chain is shown in the center (s = 1). The image on the right is an overlay of all nuclear magnetic resonance (NMR) renderings of 1NTR.

**Figure 6.**On the left, AUROC (top) and the best TPR (middle) and FPR (bottom) are shown as s ranges from 0.1 to one. On the right, AUROC (top) and the best TPR (middle) and FPR (bottom) as s ranges from 0.1 to 100. 95% bootstrap confidence intervals are shown in both graphs for all simulations.

## 7. Conclusions and Future Research

## Acknowledgments

## Conflicts of Interest

## References and Notes

- Andonov, R.; Malod-Dognin, N.; Yanev, N. Maximum contact map overlap revisited. J. Comput. Biol.
**2011**, 18, 27–41. [Google Scholar] - Andonov, R.; Yanev, N.; Malod-Dognin, N. An Efficient Lagrangian Relaxation for the Contact Map Overlap Problem. Proceedings of the 8th International Workshop on Algorithms in Bioinformatics, Karlsruhe, Germany, 15–19 September 2008; Springer-Verlag: Berlin/Heidelberg, Germany, 2008; pp. 162–173. [Google Scholar]
- Hasegawa, H.; Holm, L. Advances and pitfalls of protein structural alignment. Curr. Opin. Struct. Biol.
**2009**, 19, 341–348. [Google Scholar] - Li, S.C.; Ng, Y.K. On protein structure alignment under distance constraint. Theor. Comput. Sci.
**2011**, 412, 4187–4199. [Google Scholar] - Menke, M.; Berger, B.; Cowen, L. Matt: Local flexibility aids protein multiple structure alignment. PLoS Comput. Biol.
**2008**, 4, e10. [Google Scholar] - Poleksic, A. Algorithms for optimal protein structure alignment. Bioinformatics
**2009**, 25, 2751–2756. [Google Scholar] - Prlic, A.; Bliven, S.; Rose, P.W.; Bluhm, W.F.; Bizon, C.; Godzik, A.; Bourne, P.E. Pre-calculated protein structure alignments at the RCSB PDB website. Bioinformatics
**2010**, 26, 2983–2985. [Google Scholar] - Shibberu, Y.; Holder, A. A spectral approach to protein structure alignment. IEEE/ACM Trans. Comput. Biol. Bioinform.
**2011**, 8, 867–875. [Google Scholar] - Bonnel, N.; Mareau, P. LNA: Fast Protein Classification Using A Laplacian Characterization of Tertiary Structure.; Technical Report for IRISA: Vannes, France, 2012. [Google Scholar]
- Kifer, I.; Nussinov, R.; Wolfson, H.J. GOSSIP: A method for fast and accurate global alignment of protein structure. Bioinformatics
**2011**, 27, 925–932. [Google Scholar] - Bhattacharya, S.; Bhattacharyya, C.; Chandra, N.R. Projections for fast protein structure retrieval. BMC Bioinform.
**2006**, 7 Suppl. 5, S5. [Google Scholar] - Lena, P.D.; Fariselli, P.; Margara, L.; Vassura, M.; Casadio, R. Fast overlapping of protein contact maps by alignment of eigenvectors. Bioinformatics
**2010**, 26, 2250–2258. [Google Scholar] - Liu, W.; Srivastava, A.; Zhang, J. A mathematical framework for protein structure comparison. PLoS Comput. Biol.
**2011**, 7, e1001075. [Google Scholar] - Lu, Z.; Zhao, Z.; Fu, B. Efficient protein alignment algorithm for protein search. BMC Bioinform.
**2010**, 11 Suppl. 1, S34. [Google Scholar] - Mavridis, L.; Ritchie, D.W. 3d-blast: 3d protein structure alignment, comparison, and classification using spherical polar fourier correlations. Pac Symp. Biocomput.
**2010**. [Google Scholar] - Mirceva, G.; Cingovska, I.; Dimov, Z.; Davcev, D. Efficient approaches for retrieving protein tertiary structures. IEEE/ACM Trans. Comput. Biol. Bioinform.
**2012**, 9, 1166–1179. [Google Scholar] - Novosd, T.; Snel, V.; Abraham, A.; Yang, J.Y. Searching protein 3-D structures for optimal structure alignment using intelligent algorithms and data structures. IEEE Trans. Inf. Technol. Biomed.
**2010**, 14, 1378–1386. [Google Scholar] - Poleksic, A. Optimal pairwise alignment of fixed protein structures in subquadratic time. J. Bioinform. Comput. Biol.
**2011**, 9, 367–382. [Google Scholar] - Shibuya, T.; Jansson, J.; Sadakane, K. Linear-time protein 3-D structure searching with insertions and deletions. Algorithms Mol. Biol.
**2010**, 5, 7. [Google Scholar] - Holm, L.; Sander, C. Protein structure comparison by alignment of distance matrices. J. Mol. Biol.
**1993**, 233, 123–138. [Google Scholar] - Shindyalov, I.N.; Bourne, P.E. Protein structure alignment by incremental combinatorial extension (CE) of the optimal path. Protein Eng.
**1998**, 11, 739–747. [Google Scholar] - Zhang, Y.; Skolnick, J. TM-align: A protein structure alignment algorithm based on the TM-score. Nucleic Acids Res.
**2005**, 33, 2302–2309. [Google Scholar] - Orengo, C.A.; Taylor, W.R. SSAP: Sequential structure alignment program for protein structure comparison. Meth. Enzymol.
**1996**, 266, 617–635. [Google Scholar] - Pang, B.; Zhao, N.; Becchi, M.; Korkin, D.; Shyu, C.R. Accelerating large-scale protein structure alignments with graphics processing units. BMC Res. Notes
**2012**, 5, 116. [Google Scholar] - Holm, L.; Kriinen, S.; Rosenstrm, P.; Schenkel, A. Searching protein structure databases with DaliLite v.3. Bioinformatics
**2008**, 24, 2780–2781. [Google Scholar] - Pascual-Garca, A.; Abia, D.; Ortiz, A.R.; Bastolla, U. Cross-over between discrete and continuous protein structure space: Insights into automatic classification and networks of protein structures. PLoS Comput. Biol.
**2009**, 5, e1000331. [Google Scholar] - Redfern, O.C.; Harrison, A.; Dallman, T.; Pearl, F.M.G.; Orengo, C.A. CATHEDRAL: A fast and effective algorithm to predict folds and domain boundaries from multidomain protein structures. PLoS Comput. Biol.
**2007**, 3, e232. [Google Scholar] - Budowski-Tal, I.; Nov, Y.; Kolodny, R. FragBag, an accurate representation of protein structure, retrieves structural neighbors from the entire PDB quickly and accurately. Proc. Natl. Acad. Sci. USA
**2010**, 107, 3481–3486. [Google Scholar] - Bellman, R. Dynamic Programming; Princeton University Press: Princeton, NJ, USA, 1957. [Google Scholar]
- Needleman, S.B.; Wunsch, C.D. A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol.
**1970**, 48, 443–453. [Google Scholar] - Goonesekere, N.C.W.; Lee, B. Frequency of gaps observed in a structurally aligned protein pair database suggests a simple gap penalty function. Nucleic Acids Res.
**2004**, 32, 2838–2843. [Google Scholar] - Madhusudhan, M.S.; Marti-Renom, M.A.; Sanchez, R.; Sali, A. Variable gap penalty for protein sequence-structure alignment. Protein Eng. Des. Sel.
**2006**, 19, 129–133. [Google Scholar] - Chen, L.; Wu, L.Y.; Wang, Y.; Zhang, S.; Zhang, X.S. Revealing divergent evolution, identifying circular permutations and detecting active-sites by protein structure comparison. BMC Struct. Biol.
**2006**, 6, 18. [Google Scholar] - Salem, S.; Zaki, M.J.; Bystroff, C. FlexSnap: Flexible non-sequential protein structure alignment. Algorithms Mol. Biol.
**2010**, 5, 12. [Google Scholar] - Schmidt-Goenner, T.; Guerler, A.; Kolbeck, B.; Knapp, E.W. Circular permuted proteins in the universe of protein folds. Proteins
**2010**, 78, 1618–1630. [Google Scholar] - Poleksic, A. On complexity of protein structure alignment problem under distance constraint. IEEE/ACM Trans. Comput. Biol. Bioinform.
**2012**, 9, 511–516. [Google Scholar] - Shibberu, Y.; Holder, A.; Lutz, K. Fast Protein Structure Alignment; LNCS (LNBI).; Springer-Verlag: Berlin/Heidelberg, Germany, 2010; Volume 6053, pp. 152–165. [Google Scholar]
- Homepage of SWIG. Available online: www.swig.org (accessed on 22 October 2013).
- Cock, P.J.A.; Antao, T.; Chang, J.T.; Chapman, B.A.; Cox, C.J.; Dalke, A.; Friedberg, I.; Hamelryck, T.; Kauff, F.; Wilczynski, B.; de Hoon, MJ.L. Biopython: Freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics
**2009**, 25, 1422–1423. [Google Scholar] - Andreeva, A.; Howorth, D.; Chandonia, J.M.; Brenner, S.E.; Hubbard, T.J.P.; Chothia, C.; Murzin, A.G. Data growth and its impact on the SCOP database: New developments. Nucleic Acids Res.
**2008**, 36, D419–D425. [Google Scholar] - Andreeva, A.; Murzin, A.G. Structural classification of proteins and structural genomics: New insights into protein folding and evolution. Acta Crystallogr. Sect. F Struct. Biol. Cryst. Commun.
**2010**, 66, 1190–1197. [Google Scholar] - Conte, L.L.; Ailey, B.; Hubbard, T.J.; Brenner, S.E.; Murzin, A.G.; Chothia, C. SCOP: A structural classification of proteins database. Nucleic Acids Res.
**2000**, 28, 257–259. [Google Scholar] - Ritchie, D.W.; Ghoorah, A.W.; Mavridis, L.; Venkatraman, V. Fast protein structure alignment using Gaussian overlap scoring of backbone peptide fragment similarity. Bioinformatics
**2012**, 28, 3274–3281. [Google Scholar] - Delano, W.L. The PyMOL Molecular Graphics System; DeLano Scientific: San Carlos, CA, USA, 2002. [Google Scholar]
- The chain d1fvqa, which is copper-transporting ATPase, has a single model and no B-factor. Hence, this chain was constant throughout the study.

© 2013 by the authors; licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution license (http://creativecommons.org/licenses/by/3.0/).

## Share and Cite

**MDPI and ACS Style**

Holder, A.; Simon, J.; Strauser, J.; Taylor, J.; Shibberu, Y.
Dynamic Programming Used to Align Protein Structures with a Spectrum Is Robust. *Biology* **2013**, *2*, 1296-1310.
https://doi.org/10.3390/biology2041296

**AMA Style**

Holder A, Simon J, Strauser J, Taylor J, Shibberu Y.
Dynamic Programming Used to Align Protein Structures with a Spectrum Is Robust. *Biology*. 2013; 2(4):1296-1310.
https://doi.org/10.3390/biology2041296

**Chicago/Turabian Style**

Holder, Allen, Jacqueline Simon, Jonathon Strauser, Jonathan Taylor, and Yosi Shibberu.
2013. "Dynamic Programming Used to Align Protein Structures with a Spectrum Is Robust" *Biology* 2, no. 4: 1296-1310.
https://doi.org/10.3390/biology2041296