Next Article in Journal
Visualization Methods for DNA Sequences: A Review and Prospects
Next Article in Special Issue
The Historical Evolution and Significance of Multiple Sequence Alignment in Molecular Structure and Function Prediction
Previous Article in Journal
Targeting PDGF/PDGFR Signaling Pathway by microRNA, lncRNA, and circRNA for Therapy of Vascular Diseases: A Narrow Review
Previous Article in Special Issue
KnowVID-19: A Knowledge-Based System to Extract Targeted COVID-19 Information from Online Medical Repositories
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

deepBBQ: A Deep Learning Approach to the Protein Backbone Reconstruction

Faculty of Chemistry, University of Warsaw, Pasteura 1, 02-093 Warsaw, Poland
*
Author to whom correspondence should be addressed.
Biomolecules 2024, 14(11), 1448; https://doi.org/10.3390/biom14111448
Submission received: 2 August 2024 / Revised: 10 October 2024 / Accepted: 1 November 2024 / Published: 14 November 2024
(This article belongs to the Special Issue Artificial Intelligence (AI) in Biomedicine)

Abstract

:
Coarse-grained models have provided researchers with greatly improved computational efficiency in modeling structures and dynamics of biomacromolecules, but, to be practically useful, they need fast and accurate conversion methods back to the all-atom representation. Reconstruction of atomic details may also be required in the case of some experimental methods, like electron microscopy, which may provide C α -only structures. In this contribution, we present a new method for recovery of all backbone atom positions from just the C α coordinates. Our approach, called deepBBQ, uses a deep convolutional neural network to predict a single internal coordinate per peptide plate, based on C α trace geometric features, and then proceeds to recalculate the cartesian coordinates based on the assumption that the peptide plate atoms lie in the same plane. Extensive comparison with similar programs shows that our solution is accurate and cost-efficient. The deepBBQ program is available as part of the open-source bioinformatics toolkit Bioshell and is free for download and the documentation is available online.

1. Introduction

With the concept of coarse-grained molecular modeling introduced in the 1970s [1], molecular simulations became significantly more efficient. With this achievement, however, the need to recover all-atom structure arose. Since the 1980s [2], multiple approaches have been proposed to the problem of recalculating atomic details from a coarse grained representation of a protein conformation, typically from just the C α positions. This task is generally solved in two steps. Firstly, the Cartesian coordinates of all backbone atoms are calculated. Then, amino acid side chains are reconstructed based on the backbone conformation.
For the past few decades, many algorithms have been devised to solve the backbone reconstruction problem. Among these approaches, one can find analytical solutions [3], Dead End Elimination algorithm [4], Dynamic Programming method [5], Deep Machine Learning [6], energy minimisation [2,7,8], Gaussian Mixture Models [9] and prediction of Φ , Ψ dihedral angles [10]. In some methods, reconstruction process is followed by energy optimization to improve the result. A distinct category of algorithms utilize fragment libraries [11,12,13,14] derived from known structures to locate possible structures that do not violate a specified C α trace. The most favorable fragments to construct the entire backbone are selected using energy-based [15], homology-based [16], or geometric [5,17,18] criteria, or a combination of them [19]. Another prevalent approach relies on an observation that the local internal geometry of a C α trace typically determines the location of the remaining atoms of a protein backbone. This is especially true for regular conformations: helices and sheets, which are stabilised by a network of hydrogen bonds. In this approach, first introduced by Milik et al. [20], internal geometry of four C α atoms of a tetrapeptide is uniquely described by three degrees of freedom; traditionally r i , i + 2 , r i + 1 , i + 3 and r i , i + 3 distances are utilised. A three-dimensional grid is constructed based on these r i , i + 2 , r i + 1 , i + 3 and r i , i + 3 internal distances. Therefore, each element of that grid aggregates tetrapeptides that are structurally similar to each other. Local Cartesian coordinates of N, C and O backbone atoms are calculated in a Local Coordinate System (LCS) and averaged separately for each element of the grid. The main advantage of this method is the simplicity of its application. To reconstruct a protein backbone, one has to calculate r i , i + 2 , r i + 1 , i + 3 and r i , i + 3 internal distances and then retrieve local N, C and O Cartesian coordinates from the correct bin of the grid. In the last step, these local coordinates are transformed to the global coordinate system. Due to the ease of implementation and computational efficiency, the Milik’s approach has been prevalent in the field and implemented by popular backbone reconstruction methods such as BBQ (Backbone Building from Quadrilaterals) [21], Pulchra [22] and REMO [23].
Since the original publication of the BBQ software, it has been extensively used in various modeling scenarios. The experience we gained over these years has shown that the significant weakness of the algorithm results directly from its design. Besides alpha carbons (given as the program input), there are three other heavy backbone atoms: carbonyl carbon, carbonyl oxygen and amide nitrogen. The reconstruction process assumes their three Cartesian coordinates are independent variables and thus treats the problem as a 9-dimensional. However, this assumption is incorrect and may result in stereochemical errors such as incorrect bond lengths and planar angles far from their equilibrium values. In addition, the BBQ method makes incorrect predictions for conformations rarely observed in the PDB, such as loops. This contribution presents a novel protein backbone reconstruction method that alleviates most of these problems. We assume that the peptide plate atoms (Ci, Oi, Ni+1 and optionally the amide hydrogen of the ( i + 1 ) -th residue) do, in fact, always lay exactly on the same plane; in other words: that the ω dihedral angle assumes either −180 or 180 degrees. Under such an assumption, the 9-dimensional problem can be reduced to just one dimension. In the new approach, the only degree of freedom that has to be established for each amino acid residue is the dihedral angle between a peptide plate and a reference plane. In this work, we follow the convention by Purisima et al. [2], where a λ i dihedral angle is defined for an i-th residue as an angle between the two planes: the peptide plate between C α i 1 and C α i , and the plane defined by C α i 1 , C α i , C α i + 1 (shown in the Figure 1).
Knowing the C α positions and the λ angle values, Cartesian coordinates of all backbone atoms can be easily recovered, making λ prediction a viable candidate for a backbone reconstruction method. In this contribution, which stems from our previous BBQ method, we engineered a deep neural network to predict these λ values; hence the name of the new program: deepBBQ.
The overview of this manuscript is as follows: the Section 2 describes the algorithm in detail. It discusses the architecture, training, and validation of the ML model we used in this study. The convolutional neural network we devised takes several geometric features computed solely from C α positions as well as an amino acid sequence and predicts the λ angle as introduced above. The following Section 3 provides a thorough test of the method and a comparison with existing approaches conducted on standard benchmark sets used in the field. To avoid a situation where the protein in the test set is a homolog of an element in the training set, we also provided a test on de novo designed proteins. These tests prove the deepBBQ method to be superior over all traditional (i.e., not ML-based) algorithms. Finally, in the Section 4 we summarise our findings as well as provide prospects for future development of backbone reconstruction algorithms.

2. Methods

2.1. Protein Backbone Reconstruction

The deepBBQ reconstruction algorithm is quite simple. A peptide plate of ideal planar geometry is transformed from a local to the global coordinate system so its C α 1  − C α 2 vector overlays C α i  − C α i + 1 axis. Then, the peptide plate is rotated around the C α i  − C α i + 1 axis by λ i dihedral angle; the angle itself is predicted by a deep neural network based on C α trace geometry. The method uses idealised alanine geometry to reconstruct a peptide plate of any amino acid type except proline, where an idealised proline peptide plate is employed. Moreover, alanine and proline peptide plates may be in cis or trans conformation. Below, we describe the λ prediction network architecture and training. Validation and comparison between deepBBQ and other methods are given in the Section 3.

2.2. deepBBQ Neural Network

We devised a deep convolutional neural network to obtain λ i values. The prediction is based on 37 values per residue as input. For a given residue i, the input features are:
(a)
21 binary values for one-hot-encoded residue type representation, corresponding to 20 canonical amino acids and the ’X’ symbol representing unknown residue types
(b)
6 floating point values corresponding to distances between C α i and C α i ± j for j { 3 , 4 , 5 }
(c)
4 integer values for the number of C α atoms present within 4, 4.5, 5 and 6 Å from C α i
(d)
3 binary values for one-hot-encoded secondary structure information (either helix, sheet or loop)
(e)
One binary value as a flag for the cis/trans classification of the peptide bond between amino acids i and ( i + 1 )
(f)
2 integer values corresponding to the number of hydrogen bonds involved in helices and strands
All these features are obtained from input C α atom coordinates and have been used for a past few decades to represent a protein structure at a Coarse-Grained level [24]. Local distances along a C α trace (b) have been traditionally employed to distinguish between compact (helical-like) and extended local conformations. The number of spatial neighbors (c) indicates regions of dense packing, where backbone conformations are more rigid. The HECA [25] (H-E-C Assigner) method assigns the secondary structure classification (d) for each amino acid residue. The number of hydrogen bonds (f) is computed for each C α atom according to the Coarse-Grained H-bond potential [26]. Due to its mean-field design, this scoring function detects H-bonds only in regular secondary structure elements. Here we believe this feature allows the network to differentiate between extended loops and strand conformations, which are hard to distinguish based on any other feature we use. Finally, a cis/trans classification of a peptide bond (e) is based on the C α i −C α i + 1 distance: pseudo-bonds shorter than 3.5 Å are classified as cis [27,28].
To reconstruct the backbone of a protein structure of K residues, the user must provide Cartesian coordinates of K C α atoms. Based on this input the deepBBQ program calculates a matrix of 37 × K input features. The neural network consists of five one-dimensional convolutional layers connected sequentially. The first four layers consist of 1024 kernels of sizes 11, 9, 5 and 3 and the last layer consists of two kernels of size one, providing the two output values corresponding to the sine and cosine of the λ angle of the given residue. To train the network, we used the mean squared error loss function based on the two values. While a single parameter λ per residue is required for coordinate reconstruction, the periodic nature of a dihedral angle imposes considerable difficulties in devising a loss function. To remedy this, our neural network predicts values of sin ( λ ) and cos ( λ ) instead and the value of λ in the range of [ π , π ] is then calculated as arctan sin λ cos λ . The network architecture is depicted in the Figure 2.

2.3. Training Data and Tools

We used a non-redundant protein structure dataset provided by the PISCES server [29,30] with sequences culled at 40% identity and resolution of 1.6 Å, containing 6695 proteins. The dataset was filtered using Bioshell [31] software. We removed proteins with incomplete or incorrect fragments, e.g., with missing residues, missing backbone atoms, or important stereochemical errors. After the filtering, 6396 chains remained and were used to train the network. The training was performed in Python using tensorflow library [32]. Finally, the method has been implemented as part of the BioShell [31] suite in C++ with frugally-deep [33], a simple header-only library, providing an interface to the neural network.

2.4. Testing Set

The set of protein structures used for testing was compiled by selecting one remote homolog for each structure of the training set. We used the Jackhmmer [34] program of the HMMER package to search through protein sequences from the PDB database [35] using each amino acid sequence from the training set as a query. We attempted to select a sequence with an e-value close to 10−7 for each query. This was not possible for every query since some of PDB deposits have no homologous structures in this database, or the protein sequences are too similar to one another (e.g., point mutants). The e-value range we assumed as a selection criterion was manually adjusted to provide us with remote but still homologous hits, typically at the edge of detectability by sequence identity (below 40%). We subsequently filtered this set by removing close homologs that somehow entered the test set and structures with missing residues, alternative locations and other structural errors, as detected by the BioShell package. Finally we obtained a test set comprising 2882 protein chains. The complete list is provided in Supplementary Materials. To benchmark the new method presented in this study, we isolated C α coordinates from PDB files and ran deepBBQ on such input files.

2.5. De Novo Testing Set

To ensure our tests haven’t been biased by homolog contamination, we decided to compile a de novo testing set. We collected all the de novo designed protein structures, that are currently available in the PDB database. We found around 1500 CIF files that have been classified as De Novo. However, many of these contain only a designed peptide bound to a natural protein. Therefore, we restricted the set by selecting only chains at least 30 amino acids long and chose deposits of resolution 2.5 Å or better. Subsequently, we removed any chain that was identical in 30% or more to any protein from the training set, which might have happened accidentally. Finally, we used the clust program of the BioShell package to cluster the set of de novo proteins with the distance between two given sequences defined as:
d ( s e q u e n c e 1 , s e q u e n c e 2 ) = 1 s e q u e n c e i d e n t i t y ( s e q u e n c e 1 , s e q u e n c e 2 )
The program performed hierarchical agglomerative clustering with the single-linkage rule, producing 105 clusters selected at a 30% sequence identity level. This means that any protein sequence from a cluster is identical in less than 30% to any protein sequence that belongs to another cluster. The final set of proteins consists therefore of 105 sequences, one per cluster.

3. Results

The algorithm presented in this contribution was able to accurately and efficiently rebuild full-atom protein backbone conformations from respective C α traces. The average reconstruction error, i.e., coordinate root mean square deviation (crmsd) value measured over all heavy backbone atoms for each test protein, was 0.19 ± 0.32 Å, and the most probable value (mode) was 0.045 Å. A histogram of these error values collected for the test set is shown in the Figure 3a. We have also investigated how the reconstruction error depends on the secondary structure type (see Figure 3c and Table 1). It is clear that the residues involved in helices perform visibly better than those in strands and loops. It can be attributed to the fact that the orientation of a peptide plate in helical conformations is very well-defined. This makes the prediction of respective λ values much easier for the network. As shown in Figure 4, both the distribution of λ and its reconstruction error are narrower for the helices than the other structure elements. Performance for β -strands is worse, explained by the larger flexibility of the extended secondary structure, which results in a broader peak on the respective histogram of λ values (c.f. Figure 4a). Unsurprisingly, the results are the worst for loops, which are the least organized. In their case, the λ distribution is bimodal since loops in proteins can adopt both helical-like and extended conformations. Figure 3b and Table 1 present the reconstruction error separately for each atom type. These results show that carbonyl oxygen atoms are the most prone to reconstruction errors. These atoms are the furthest away from the axis by each peptide plate is rotated. Therefore, any inaccuracies in predicted λ angles introduce the most significant error in the Cartesian space. Another important source of error is high mobility of residues, located near the chain termini and chain breaks. Overall, the deepBBQ algorithm outperforms most competing approaches (see Section 3.1). Thanks to the simple implementation in C++ it is also computationally efficient. An example of deepBBQ reconstruction of 4Y6W PDB deposit superimposed on the experimental structure is shown in Figure 5. The zoom-in in the figure depicts that backbone reconstructed for the helical conformation is nearly identical to the original one; loops and strand are, therefore, the main source of reconstruction inaccuracy.

3.1. Comparison with Other Methods

To compare deepBBQ with other approaches, we run the program on the test set from Moore et al. [9]. We decided to rely on this benchmark, because it provided the most exhaustive set of reconstruction tools we could find in the literature. Some of these methods are no longer available online, and repeating such a study would impose serious technical difficulties. Therefore, we simply run our calculations with deepBBQ and cg2all [6], which, according to the authors, is best performing algorithm. These resulted in two additional columns added to the Moore et al. [9] values; results are shown in Table 2 and Figure 6.
Indeed, deepBBQ outperforms most of the other methods; the only program that obtains better accuracy is cg2all, which, in addition to the backbone atoms, also rebuilds protein side chains. It uses, however, a much more complex machine learning model, which makes it significantly slower. While deepBBQ uses the frugally-deep library, which does not implement parallelism directly, one can run multiple program instances simultaneously for different input files. Running deepBBQ this way for the protein test set from the Table 2 took 10.31 s, while cg2all took 189.35 s (both calculations were ran on the same Intel Xeon E5649 CPU). Even when run on a single core, deepBBQ outperforms cg2all when it utilises the entire processor.

3.2. Reconstruction of De Novo Proteins

We devised a testing set comprising 105 de novo designed structures to further investigate the reconstruction accuracy. Such human-made proteins, by definition, have no homologs that may be found in the universe of life. Moreover, we ensured each test protein from this set was at most 30% identical to any protein from the training set. Results are shown in the Figure 7, which provides a histogram of reconstruction error in Å both for deepBBQ and cg2all methods.
To our surprise, deepBBQ performed significantly better in this test: mean reconstruction error was 0.14 ± 0.27 Å and the most probable value (mode) was 0.045 Å (values reported above for the first test set were 0.19 ± 0.32 Å and 0.045 Å, respectively). We explain this result by the fact, that de novo designed proteins are mostly α -helical. It is much easier to design an α bundle than a β barrel, which has resulted in the high popularity of α -bundles in the field of protein design. This, however, introduced a considerable bias to our de novo test.

3.3. Accuracy of the Two Steps of the Reconstruction Process

As described in the Section 2, the backbone reconstruction process consists of two stages: λ prediction and conversion to Cartesian coordinates, at both of which an error can be introduced. Errors in the first step appear because of the flaws of λ prediction, while errors in the second step stem primarily from the assumption that a peptide plate is flat and its bond angles equals to idealised values. To find out how much of a reconstruction error can be attributed to each of these two factors, we assumed λ values were predicted exactly (i.e., with no error). For this comparison, we used a set of 43 test proteins; for each of these, we calculated the real values of λ based on the actual experimental structure. Then we ran the backbone Cartesian coordinate reconstruction using these true (experimental) λ angles and calculated crmsd. For the comparison, we also reconstructed backbone atoms of each of the 43 test cases as described above, i.e., based on λ values predicted by the deepBBQ neural network. A comparison between these two sets of results is given in the Table 3. As expected, the reconstruction error for the true λ values is significantly lower, suggesting that the error is primarily caused by the imperfect λ prediction, rather than the geometry assumptions present in the second step.

4. Discussion

A few important conclusions can be drawn from the analysis of the results. Firstly, machine learning approaches, represented here by cg2all and deepBBQ, provide much better accuracy than the traditional solutions, such as BBQ, PULCHRA, and others, shown in the Figure 6. Secondly, one can obtain better results within the machine learning framework by increasing a network size and complexity. Indeed, the two ML-based approaches compared in this study differ significantly in the input features they utilize. Both methods rely on a local C α trace geometry; cg2all however also includes explicit map of spatial neighbors, i.e., a contact map computed from C α positions. This type of 2D information certainly gives cg2all an advantage over deepBBQ, which works solely on 1D protein structure representation. The two machine models also differ in their architecture. While deepBBQ utilizes a simple convolutional neural network, cg2all includes a graph neural network with attention layers. The higher accuracy, however, comes at the cost of increased computation time, as shown by the comparison described in this article. Finally, the reconstruction results obtained for the true λ values (Table 3) show, that the assumption about the planarity of a plate plate is quite realistic and does not introduce significant reconstruction errors on its own. The average prediction error of λ angles from the neural network presented in this study was 0.14 radians. However, the distribution of these values is quite broad, especially in loops and β -strands. We believe the accuracy of the deepBBQ approach can be further improved if λ values are predicted more accurately. Given the relatively long tail of the λ error distribution, this goal seems feasible.
In summary, deepBBQ provides a competitive solution for protein backbone coordinates reconstruction, ensuring high accuracy with good performance. It can be used as a part of the BioShell package, but it is also compiled as a standalone program with convenient command-line interface. The source code, as well as documentation, have been made publicly available.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/biom14111448/s1, S1.txt file provides the de novo test set, comprising de novo designed proteins, used in the final benchmark.

Author Contributions

Conceptualization, D.G.; Software, J.D.K., M.G. and D.G.; Validation, M.G., J.D.K. and P.Ś.; Writing—original draft, J.D.K., M.G., P.Ś. and D.G. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The source code is available at https://bitbucket.org/dgront/bioshell (accessed on 10 October 2024). The website https://bioshell.readthedocs.io (accessed on 10 October 2024) contains full reference documentation.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Levitt, M.; Warshel, A. Computer Simulation of Protein Folding. Nature 1975, 253, 694–698. [Google Scholar] [CrossRef] [PubMed]
  2. Purisima, E.O.; Scheraga, H.A. Conversion from a Virtual-bond Chain to a Complete Polypeptide Backbone Chain. Biopolymers 1984, 23, 1207–1224. [Google Scholar] [CrossRef] [PubMed]
  3. Lubecka, E.A.; Liwo, A. ESCASA: Analytical estimation of atomic coordinates from coarse-grained geometry for nuclear-magnetic-resonance-assisted protein structure modeling. I. Backbone and and Hβ Protons. J. Comput. Chem. 2021, 42, 1579–1589. [Google Scholar] [CrossRef] [PubMed]
  4. Adcock, S.A. Peptide Backbone Reconstruction Using Dead-End Elimination and a Knowledge-Based Forcefield. J. Comput. Chem. 2004, 25, 16–27. [Google Scholar] [CrossRef]
  5. Holm, L.; Sander, C. Database Algorithm for Generating Protein Backbone and Side-Chain Co-Ordinates from a CCLTrace. J. Mol. Biol. 1991, 218, 183–194. [Google Scholar] [CrossRef]
  6. Heo, L.; Feig, M. One Bead per Residue Can Describe All-Atom Protein Structures. Structure 2024, 32, 97–111.e6. [Google Scholar] [CrossRef]
  7. Payne, P.W. Reconstruction of Protein Conformations from Estimated Positions of the Cα Coordinates. Protein Sci. 1993, 2, 315–324. [Google Scholar] [CrossRef]
  8. Kaźmierkiewicz, R.; Liwo, A.; Scheraga, H.A. Energy-based Reconstruction of a Protein Backbone from Its A-carbon Trace by a Monte-Carlo Method. J. Comput. Chem. 2002, 23, 715–723. [Google Scholar] [CrossRef]
  9. Moore, B.L.; Kelley, L.A.; Barber, J.; Murray, J.W.; MacDonald, J.T. High–Quality Protein Backbone Reconstruction from Alpha Carbons Using Gaussian Mixture Models. J. Comput. Chem. 2013, 34, 1881–1889. [Google Scholar] [CrossRef]
  10. Iwata, Y.; Kasuya, A.; Miyamoto, S. An Efficient Method for Reconstructing Protein Backbones from α-Carbon Coordinates. J. Mol. Graph. Model. 2002, 21, 119–128. [Google Scholar] [CrossRef]
  11. Etchebest, C.; Benros, C.; Hazout, S.; De Brevern, A.G. A Structural Alphabet for Local Protein Structures: Improved Prediction Methods. Proteins Struct. Funct. Bioinform. 2005, 59, 810–827. [Google Scholar] [CrossRef] [PubMed]
  12. Rooman, M.J.; Rodriguez, J.; Wodak, S.J. Automatic Definition of Recurrent Local Structure Motifs in Proteins. J. Mol. Biol. 1990, 213, 327–336. [Google Scholar] [CrossRef] [PubMed]
  13. Pandini, A.; Fornili, A.; Kleinjung, J. Structural Alphabets Derived from Attractors in Conformational Space. BMC Bioinform. 2010, 11, 97. [Google Scholar] [CrossRef]
  14. Park, B.H.; Levitt, M. The Complexity and Accuracy of Discrete State Models of Protein Structure. J. Mol. Biol. 1995, 249, 493–507. [Google Scholar] [CrossRef]
  15. Maupetit, J.; Gautier, R.; Tuffery, P. SABBAC: Online Structural Alphabet-based Protein BackBone Reconstruction from Alpha-Carbon Trace. Nucleic Acids Res. 2006, 34, W147–W151. [Google Scholar] [CrossRef]
  16. Jones, T.; Thirup, S. Using Known Substructures in Protein Model Building and Crystallography. Embo J. 1986, 5, 819–822. [Google Scholar] [CrossRef]
  17. Claessens, M.; Van Cutsem, E.; Lasters, I.; Wodak, S. Modelling the Polypeptide Backbone with `Spare Parts’ from Known Protein Structures. Protein Eng. Des. Sel. 1989, 2, 335–345. [Google Scholar] [CrossRef]
  18. Reid, L.S.; Thornton, J.M. Rebuilding Flavodoxin from Cα Coordinates: A Test Study. Proteins Struct. Funct. Bioinform. 1989, 5, 170–182. [Google Scholar] [CrossRef]
  19. Levitt, M. Accurate Modeling of Protein Conformation by Automatic Segment Matching. J. Mol. Biol. 1992, 226, 507–533. [Google Scholar] [CrossRef]
  20. Milik, M.; Kolinski, A.; Skolnick, J. Algorithm for Rapid Reconstruction of Protein Backbone from Alpha Carbon Coordinates. J. Comput. Chem. 1997, 18, 80–85. [Google Scholar] [CrossRef]
  21. Gront, D.; Kmiecik, S.; Kolinski, A. Backbone Building from Quadrilaterals: A Fast and Accurate Algorithm for Protein Backbone Reconstruction from Alpha Carbon Coordinates. J. Comput. Chem. 2007, 28, 1593–1597. [Google Scholar] [CrossRef] [PubMed]
  22. Rotkiewicz, P.; Skolnick, J. Fast Procedure for Reconstruction of Full-Atom Protein Models from Reduced Representations. J. Comput. Chem. 2008, 29, 1460–1465. [Google Scholar] [CrossRef] [PubMed]
  23. Li, Y.; Zhang, Y. REMO: A New Protocol to Refine Full Atomic Protein Models from C-Alpha Traces by Optimizing Hydrogen-Bonding Networks. Proteins Struct. Funct. Bioinform. 2009, 76, 665–676. [Google Scholar] [CrossRef] [PubMed]
  24. Kmiecik, S.; Gront, D.; Kolinski, M.; Wieteska, L.; Dawid, A.E.; Kolinski, A. Coarse-Grained Protein Models and Their Applications. Chem. Rev. 2016, 116, 7898–7936. [Google Scholar] [CrossRef] [PubMed]
  25. Saqib, M.N.; Kryś, J.D.; Gront, D. Automated Protein Secondary Structure Assignment from Cα Positions Using Neural Networks. Biomolecules 2022, 12, 841. [Google Scholar] [CrossRef]
  26. Kryś, J.D.; Gront, D. Coarse-Grained Potential for Hydrogen Bond Interactions. J. Mol. Graph. Model. 2023, 124, 108507. [Google Scholar] [CrossRef]
  27. Liljas, A.; Liljas, L.; Piskur, J.; Lindblom, G.; Nissen, P.; Kjeldgaard, M. Textbook of Structural Biology; World Scientific: Singapore, 2009. [Google Scholar] [CrossRef]
  28. Godzik, A.; Kolinski, A.; Skolnick, J. Lattice Representations of Globular Proteins: How Good Are They? J. Comput. Chem. 1993, 14, 1194–1202. [Google Scholar] [CrossRef]
  29. Wang, G.; Dunbrack, R.L. PISCES: A Protein Sequence Culling Server. Bioinformatics 2003, 19, 1589–1591. [Google Scholar] [CrossRef]
  30. Wang, G.; Dunbrack, R.L. PISCES: Recent Improvements to a PDB Sequence Culling Server. Nucleic Acids Res. 2005, 33, W94–W98. [Google Scholar] [CrossRef]
  31. Macnar, J.M.; Szulc, N.A.; Kryś, J.D.; Badaczewska-Dawid, A.E.; Gront, D. BioShell 3.0: Library for Processing Structural Biology Data. Biomolecules 2020, 10, 461. [Google Scholar] [CrossRef]
  32. Developers, T. TensorFlow. Zenodo. 2024. [Google Scholar] [CrossRef]
  33. Hermann, T. Frugally-Deep. Available online: https://github.com/Dobiasd/frugally-deep (accessed on 10 October 2024).
  34. Johnson, L.S.; Eddy, S.R.; Portugaly, E. Hidden Markov Model Speed Heuristic and Iterative HMM Search Procedure. BMC Bioinform. 2010, 11, 431. [Google Scholar] [CrossRef] [PubMed]
  35. wwPDB Consortium. Protein Data Bank: The Single Global Archive for 3D Macromolecular Structure Data. Nucleic Acids Res. 2019, 47, D520–D528. [Google Scholar] [CrossRef] [PubMed]
Figure 1. λ angle parameterizing the backbone conformation.
Figure 1. λ angle parameterizing the backbone conformation.
Biomolecules 14 01448 g001
Figure 2. Architecture of the deepBBQ neural network.
Figure 2. Architecture of the deepBBQ neural network.
Biomolecules 14 01448 g002
Figure 3. Density histograms of rmsd values [Å] between original backbone positions and ones reconstructed by deepBBQ for the test set. Histograms were cut off at 1.0 Å for readability.
Figure 3. Density histograms of rmsd values [Å] between original backbone positions and ones reconstructed by deepBBQ for the test set. Histograms were cut off at 1.0 Å for readability.
Biomolecules 14 01448 g003
Figure 4. Density histograms of real λ values and λ reconstruction errors for the test set, grouped by secondary structure element. Accounting for λ periodicity, reconstruction error of λ is between 0 and π , but its histogram was cut off at 1.0 for readability.
Figure 4. Density histograms of real λ values and λ reconstruction errors for the test set, grouped by secondary structure element. Accounting for λ periodicity, reconstruction error of λ is between 0 and π , but its histogram was cut off at 1.0 for readability.
Biomolecules 14 01448 g004
Figure 5. Example protein (4Y6W) rebuilt using deepBBQ. The rebuilt structure is shown in blue, while the superimposed native structure is in green.
Figure 5. Example protein (4Y6W) rebuilt using deepBBQ. The rebuilt structure is shown in blue, while the superimposed native structure is in green.
Biomolecules 14 01448 g005
Figure 6. Comparison of backbone reconstruction accuracy for different methods. The bar heights represent RMSD between real and reconstructed backbone positions. The methods are listed chronologically, based on the time of their publication. Methods labeled “+ EM” include an additional energy minimization step. Blue bars indicate methods using PDB structure fragments, orange bars mean Milik method peptide plate insertion methods, red means methods using torsion angle prediction, gray is for machine learning methods and light yellow for others. Protein test sets come from Table 2 for all methods except for those labeled with “*” (Milik et al., Iwata et al. and Adcock), for which they were taken from their corresponding articles [4,10,20].
Figure 6. Comparison of backbone reconstruction accuracy for different methods. The bar heights represent RMSD between real and reconstructed backbone positions. The methods are listed chronologically, based on the time of their publication. Methods labeled “+ EM” include an additional energy minimization step. Blue bars indicate methods using PDB structure fragments, orange bars mean Milik method peptide plate insertion methods, red means methods using torsion angle prediction, gray is for machine learning methods and light yellow for others. Protein test sets come from Table 2 for all methods except for those labeled with “*” (Milik et al., Iwata et al. and Adcock), for which they were taken from their corresponding articles [4,10,20].
Biomolecules 14 01448 g006
Figure 7. Reconstruction error measured for deepBBQ and cg2all methods on de novo testing set.
Figure 7. Reconstruction error measured for deepBBQ and cg2all methods on de novo testing set.
Biomolecules 14 01448 g007
Table 1. Mean and mode of the distribution ρ of distance d between original backbone atoms and their reconstruction by deepBBQ (i.e., reconstruction error): (left) grouped by element and (right) grouped by Secondary Structure Element (SSE) type.
Table 1. Mean and mode of the distribution ρ of distance d between original backbone atoms and their reconstruction by deepBBQ (i.e., reconstruction error): (left) grouped by element and (right) grouped by Secondary Structure Element (SSE) type.
AtomMean [Å]Mode [Å] SSEMean [Å]Mode [Å]
C0.12 ± 0.180.045 Helix0.12 ± 0.200.042
N0.11 ± 0.150.039 Strand0.18 ± 0.230.057
O0.33 ± 0.460.069 Coil0.23 ± 0.380.056
Table 2. Values of rmsd [Å] between the real backbone positions and reconstructions provided by the methods. Values for deepBBQ and cg2all were calculated by us, while all the others come from [9].
Table 2. Values of rmsd [Å] between the real backbone positions and reconstructions provided by the methods. Values for deepBBQ and cg2all were calculated by us, while all the others come from [9].
PDB CodePD2 + MinBBQMaxSproutPULCHRASABBACREMOdeepBBQcg2all
4EO00.2110.2030.3920.3480.2860.4370.2290.147
4F7H0.2930.3560.5050.4710.4800.5510.3180.232
4EXO0.2640.2890.4470.4590.4360.5300.2350.190
4F7V0.4440.4100.4360.6170.4500.6430.3770.199
4FAK0.2850.3250.4610.4610.3660.5140.1840.207
4ANN0.2380.2920.4050.4520.4450.5140.3200.221
4FFK0.3550.4470.3900.6040.4410.6070.3610.264
4FD50.3180.3190.4260.4210.4060.5150.2530.131
4AVX0.2270.3140.3600.4020.3180.4740.2280.169
4EV10.2390.3040.3960.4110.3850.5040.2190.159
4EG90.3170.4270.4780.5040.3500.5310.3180.209
4EIU0.3340.3390.4680.5870.5550.5560.2820.205
4F780.3640.4240.5520.5580.4770.6160.3170.253
4FCU0.2980.3420.4550.4580.5010.5020.2660.144
4FBR0.4040.3990.5280.5860.5520.6220.2560.149
4FB70.2480.3350.3440.4610.3590.5060.1830.193
4FIK0.3740.3560.2880.5680.5750.5630.3260.134
4FAT0.2270.4130.3590.5400.4310.6320.3370.124
4FE30.2750.3180.4890.4490.4500.5070.2070.119
4FCS0.3290.3260.4580.5040.4520.5330.1980.128
4E9L0.2670.3150.4030.5460.5440.5400.2290.118
4F8X0.3300.3800.5380.5620.4880.5420.3010.117
4FHG0.3180.3500.3970.5490.4250.5060.2980.126
4EYO0.2710.3680.3840.5690.3440.5580.2330.201
4F8J0.2790.3540.3880.5100.3700.4580.2350.214
3VTF0.3100.3420.4680.5510.4010.5380.1830.127
4FE90.3920.4060.4330.5730.5340.5680.2540.152
4AVZ0.3520.3690.4800.5680.4890.6020.2440.145
mean0.3060.3510.4330.5100.4400.5420.2640.171
σ 0.0580.0520.0620.0700.0760.0520.0550.044
Table 3. Comparison of results for deepBBQ method for real λ angle (the true column) and the one predicted by neural network (the predicted column).
Table 3. Comparison of results for deepBBQ method for real λ angle (the true column) and the one predicted by neural network (the predicted column).
PDB CodeNresTrue [Å]Predicted [Å]
1CRN460.0510.615
6PTI580.0570.234
1CTF680.1650.344
1UBQ760.0550.432
2OZ91040.0430.143
4EO01150.0490.205
2MHR1180.1650.422
4F7H1350.1600.352
2FOX1380.0320.355
5NLL1380.1120.351
4EXO1440.1270.300
4F7V1610.1570.393
4FAK1630.1150.205
2ALP1980.1030.297
4ANN2100.1490.317
4FFK2140.1130.387
4FD52160.1100.252
4AVX2230.1090.250
4EV12290.0620.260
4EG92320.1410.308
4EIU2410.1340.299
4F782540.1270.323
4FCU2620.1200.315
4FBR2730.0610.291
4FB72740.0550.150
4FIK2780.1250.339
2PRK2790.0940.258
4FAT2800.1460.339
4FE32950.0600.235
5CPA3070.1720.391
4FCS3150.0790.195
4E9L3180.0850.234
3APP3230.1130.301
4F8X3350.1320.309
9WGA3400.3160.367
4FHG3420.1380.306
4EYO3580.0880.225
4F8J3650.0880.249
3VTF4320.0680.205
2CTS4370.0960.308
4FE94500.0970.250
1TIM4940.4120.465
4AVZ6080.0800.246
Avarage 0.1150.303
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Kryś, J.D.; Głowacki , M.; Śmieja , P.; Gront, D. deepBBQ: A Deep Learning Approach to the Protein Backbone Reconstruction. Biomolecules 2024, 14, 1448. https://doi.org/10.3390/biom14111448

AMA Style

Kryś JD, Głowacki  M, Śmieja  P, Gront D. deepBBQ: A Deep Learning Approach to the Protein Backbone Reconstruction. Biomolecules. 2024; 14(11):1448. https://doi.org/10.3390/biom14111448

Chicago/Turabian Style

Kryś, Justyna D., Maksymilian Głowacki , Piotr Śmieja , and Dominik Gront. 2024. "deepBBQ: A Deep Learning Approach to the Protein Backbone Reconstruction" Biomolecules 14, no. 11: 1448. https://doi.org/10.3390/biom14111448

APA Style

Kryś, J. D., Głowacki , M., Śmieja , P., & Gront, D. (2024). deepBBQ: A Deep Learning Approach to the Protein Backbone Reconstruction. Biomolecules, 14(11), 1448. https://doi.org/10.3390/biom14111448

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop