Using AlphaFold Predictions in Viral Research

Elucidation of the tertiary structure of proteins is an important task for biological and medical studies. AlphaFold, a modern deep-learning algorithm, enables the prediction of protein structure to a high level of accuracy. It has been applied in numerous studies in various areas of biology and medicine. Viruses are biological entities infecting eukaryotic and procaryotic organisms. They can pose a danger for humans and economically significant animals and plants, but they can also be useful for biological control, suppressing populations of pests and pathogens. AlphaFold can be used for studies of molecular mechanisms of viral infection to facilitate several activities, including drug design. Computational prediction and analysis of the structure of bacteriophage receptor-binding proteins can contribute to more efficient phage therapy. In addition, AlphaFold predictions can be used for the discovery of enzymes of bacteriophage origin that are able to degrade the cell wall of bacterial pathogens. The use of AlphaFold can assist fundamental viral research, including evolutionary studies. The ongoing development and improvement of AlphaFold can ensure that its contribution to the study of viral proteins will be significant in the future.


Introduction
Proteins play a crucial role both in building biological structures and in managing biochemical processes in living organisms. Proteins are linear unbranched polymers of amino acid residues. To possess biological activity, proteins adopt unique three-dimensional structures (folds), which is known as the "native state" [1,2]. The folded structure is determined by the amino acid sequence of the protein ("primary structure") [3,4], and the formation of the folded native conformation ("tertiary structure") starts with rapid folding into a "secondary structure", which is a local spatial conformation of the polypeptide backbone, stabilised by intramolecular hydrogen bonds [5]. The most common elements of the secondary structure are α-helices and β-sheets. The so-called "quaternary structure" is the result of assembly of the folded proteins or protein subunits into protein complexes of fully functional protein [6]. Thus, the protein structure can be described using four levels of organisation: a primary, secondary, tertiary and, for some proteins, quaternary structure ( Figure 1).
Knowledge of the three-dimensional structure of proteins is important for understanding their functions. A detailed knowledge of three-dimensional structure is crucial for protein structure-based drug design [8]. The main techniques for determining protein structures are X-ray crystallography [9], NMR spectroscopy [10] and Cryoelectron microscopy [11]. Experimentally determined protein structures are stored in databases, the largest of them being the publicly available Protein Data Bank (PDB) (https: //www.rcsb.org/, accessed on 1 March 2023). As of March 2023, the PDB database contained about 202,000 experimentally determined structures, most of which belonged to proteins. This is, however, just a small fraction of all proteins for which the primary sequences are known. The UniProtKB/TrEMBL database alone contains over 200 million sequence records, (database release 2022_05 of 14 December 2022 contained 229,580,745 sequence entries, https://www.ebi.ac.uk/uniprot/TrEMBLstats, accessed on 1 March 2023). Thus, the prediction of the three-dimensional structure of a protein is an urgent problem that aims to fill the gap between the large known number of primary sequences and the relatively small number of known structures.
Curr. Issues Mol. Biol. 2023, 3, FOR PEER REVIEW 2 Figure 1. Three-dimensional structure of SARS-CoV-2 trimeric spike glycoprotein, determined with electron microscopy (PDB code #7DF3 [7]). (a) Monomeric subunit coloured based on a rainbow gradient scheme, where the N-terminus of the polypeptide chain is coloured blue, and the C-terminus is coloured red. (b) Monomeric subunit coloured based on the secondary structure, where αhelices are coloured cyan, β-sheets are coloured magenta, and loops are coloured wheat. (c) Quaternary structure of functional trimer, where each monomer is coloured in a different colour.
Knowledge of the three-dimensional structure of proteins is important for understanding their functions. A detailed knowledge of three-dimensional structure is crucial for protein structure-based drug design [8]. The main techniques for determining protein structures are X-ray crystallography [9], NMR spectroscopy [10] and Cryoelectron microscopy [11]. Experimentally determined protein structures are stored in databases, the largest of them being the publicly available Protein Data Bank (PDB) (https://www.rcsb.org/, accessed on 1 March 2023). As of March 2023, the PDB database contained about 202,000 experimentally determined structures, most of which belonged to proteins. This is, however, just a small fraction of all proteins for which the primary sequences are known. The UniProtKB/TrEMBL database alone contains over 200 million sequence records, (database release 2022_05 of 14 December 2022 contained 229,580,745 sequence entries, https://www.ebi.ac.uk/uniprot/TrEMBLstats, accessed on 1 March 2023). Thus, the prediction of the three-dimensional structure of a protein is an urgent problem that aims to fill the gap between the large known number of primary sequences and the relatively small number of known structures.
Prediction of the three-dimensional structure of proteins is a difficult task. For a long time, the main prediction methods included comparative modelling (homology modelling), threading and ab initio and machine-learning approaches [12,13]. The development of end-to-end machine-learning approaches in recent years has resulted in the emergence of new techniques that can often outperform other methods [2,14]. Moreover, recent progress associated with deep-learning methods enables speculation about a revolution in protein-structure prediction [15]. One of the most popular deep-learning techniques is Alphabet-Google DeepMind s neural network-based end-to-end solution AlphaFold2 (Al-phaFold, AF2), which was presented in the CASP14 competition [16], the second iteration Three-dimensional structure of SARS-CoV-2 trimeric spike glycoprotein, determined with electron microscopy (PDB code #7DF3 [7]). (a) Monomeric subunit coloured based on a rainbow gradient scheme, where the N-terminus of the polypeptide chain is coloured blue, and the C-terminus is coloured red. (b) Monomeric subunit coloured based on the secondary structure, where α-helices are coloured cyan, β-sheets are coloured magenta, and loops are coloured wheat. (c) Quaternary structure of functional trimer, where each monomer is coloured in a different colour.
Prediction of the three-dimensional structure of proteins is a difficult task. For a long time, the main prediction methods included comparative modelling (homology modelling), threading and ab initio and machine-learning approaches [12,13]. The development of end-to-end machine-learning approaches in recent years has resulted in the emergence of new techniques that can often outperform other methods [2,14]. Moreover, recent progress associated with deep-learning methods enables speculation about a revolution in protein-structure prediction [15]. One of the most popular deep-learning techniques is Alphabet-Google DeepMind's neural network-based end-to-end solution AlphaFold2 (AlphaFold, AF2), which was presented in the CASP14 competition [16], the second iteration of the AlphaFold system entered in CASP13 [17]. AlphaFold employs a deep-learning approach and a conventional neural network. This technique is able to predict the distance and torsion distribution of proteins, using training schemes of experimentally determined PDB structures, protein primary sequences and the multiple sequence alignment (MSA) of proteins. In CASP14, AlphaFold2 structures had a median backbone accuracy of 0.96 Å RMSD 95 (Cα root-mean-square deviation at 95% residue coverage) and an allatom accuracy of 1.5 Å RMSD 95 . The corresponding values for the prediction of the best alternative method were 2.8 Å and 3.5 Å [16]. The high level of accuracy of AlphaFold2 predictions boosted the popularity of this technique. One might even talk about "AlphaFold mania", given the astonishing increase in the number of journal articles and preprints citing AlphaFold2 AI software [18]. As of the beginning of March 2023, the original paper [16] published in July 2021, which described AlphaFold2 s release, with its source code, was accessed about a million times and, according to the Web of Science metric, was cited about 5000 times (https://www.nature.com/articles/s41586-021-03819-2/metrics, accessed on 1 March 2023).
The updated version of AlphaFold2, called AlphaFold-Multimer, also developed by DeepMind, was released several months after AlphaFold2 [19]. AlphaFold-Multimer was designed to predict the three-dimensional structure of protein complexes. AlphaFold-Multimer was benchmarked on a large dataset of 4446 protein complexes, successfully predicting the interface in 70% of cases of heteromeric interfaces and in 72% of cases of homomeric interfaces. A high level of predictive accuracy was demonstrated in 26% of cases of heteromeric interfaces and 36% of cases of homomeric interfaces.
The level of accuracy of AlphaFold (and other AI protein-folding methods, such as RoseTTAFold [20]) makes it tempting to use AlphaFold predictions in various fields of biological and medical research. In particular, virology, the importance of which has become especially evident in the light of the recent COVID-19 pandemic, has received a new tool that can solve a number of problems requiring the knowledge of three-dimensional protein structures. Virology studies viruses, probably the most widespread entities on Earth [21]. Viruses infect various cellular organisms, including eukaryotes, archaea and bacteria. In the latter case, they are called "bacteriophages", or "phages". Phages and their proteins that are harmful to bacteria can be used to fight bacterial infection in humans, animals and plants [22,23]. So-called "phage therapy", or the use of bacteriophages to treat bacterial infections, can assist in the context of the rise of antimicrobial resistance [24]. This review describes different cases of the use of AlphaFold for the purposes of viral research. It summarizes the results of the studies involving AlphaFold predictions, analyses the possible advantages and disadvantages of AlphaFold for predictions of viral proteins and discusses corresponding studies (Table 1). to study the mechanisms of influence of HSV-1 on sphingolipid metabolism Using AF2 predictions, the residues essential for the binding of involved proteins were identified and experiments demonstrating that HSV-1 modifies the sphingolipid metabolism via specific protein-protein interactions were conducted. Distinctive features of phage RBPs, the use of RBPs as antibacterial agents and the application of AlphaFold for the prediction of RBPs' structure were described. to explore the mechanism of formation of the phage nucleus The ability of Phage Nuclear Enclosure (PhuN) protein to spontaneously assemble into 2D sheets with p2 and p4 symmetry was shown. The p2 symmetric state was resolved by Cryo-EM. AF2 was used to build a model of the 2D array.

Šiborová et al., 2022 [50]
Escherichia phage SU10 to study the mechanism of phage genome delivery Cryo-EM and Cryo-ET characterisation of the attachment of the phage to the host cell was presented. The formation of a tail nozzle after rearrangement was shown. AF2 was used to build tail models.

Conners et al., 2021 [51]
Klebsiella phage f1 to study the structural bases of the mechanism of phage egress and its practical application Cryo-EM structure phage-encoded pIV secretin was determined, and the mechanism for phage egress was proposed. AF2 was used to predict the structure of the N0 domain of pIV.
Eskenazi et al., 2022 [52] Klebsiella phage M1 to investigate the effectiveness of combined pre-adapted bacteriophage therapy and antibiotics for the treatment of fracture-related infection The therapy resulted in an objective improvement in the patient's wounds and overall condition. The combination of phage and antibiotic therapy was demonstrated to be highly effective against the patient's K. pneumoniae strain. AlphaFold was used for the modelling of original and mutated phage proteins.

McGinnis et al., 2022 [53]
Mycobacterium phage TipsytheTRex to study the mechanism of the interaction of the immunity repressor and DNA A Dual DNA binding domains model of the repressor was proposed. An AlphaFold model of the repressor protein was used to significantly improve the structure obtained using single-wavelength anomalous diffraction phasing. to propose an approach that addresses the absence of cofactors and co-or post-translational modifications in AF2 models This approach combines sequence and structure data to transfer protein glycosylation from a library of structurally balanced glycan blocks to the AlphaFold model. The algorithm was integrated into the Privateer software.
Van Breugel et al., 2022 [77] to assess the quality of AF2 models in the study of centrosome and centriole biogenesis AF2 models can reveal important insights into the structural features of two key proteins in centrosome and centriole biogenesis, CEP192 and CEP44. The AF2 algorithm was used to predict, with subsequent experimental validation, previously unknown primary features in the structure of TTBK2 associated with CEP164, as well as the Chibby1-FAM92A complex.

Authors, Year Virus or Viral Group Study Aim (s) Results and AF2 Usage
Lane 2023 [78] to discuss AF2 restrictions concerning structural distribution and other issues As deep-machine-learning algorithms develop, they require more and more experimental data. In the author's opinion, experimental methods such as time-resolved crystallography, cryo-EM data and others can provide information that enables researchers to penetrate the essence of protein functioning.
Bertoline et al., 2023 [79] to provide an overview of changes in protein structure prediction before and after the advent of AF2 The advent of AF2 has taken the protein folding prediction problem to the next step; however, it has several limitations. AF2 instigated the emergence of new tools, such as ESMfold, which, although inferior in accuracy, use different approaches, which enable very fast predictions.
Buel et al., 2022 [80] to study the ability of AF2 to predict the effect of missense mutations on structure AF2 seems not to be able to predict the effect of missense mutations on the 3D structure of proteins. Differences between mutated and wild-type structures predicted by AlphaFold were extremely small.
Pak et al., 2021 [81] to evaluate the ability of AlphaFold to predict the impact of single mutations on protein stability It seems impossible to obtain a reliable evaluation of the impact of mutation on protein stability with the direct application of AI predictions.

Application of AlphaFold for SARS-CoV-2 Research
The outbreak of severe acute respiratory syndrome caused by coronavirus 2 (SARS-CoV-2, realm Ribozyviria, class Pisoniviricetes, order Nidovirales, family Coronaviridae, genus Betacoronavirus) and the spread of associated infection boosted research on coronaviruses. The structure of SARS-CoV-2 spike (S) glycoprotein, the main target of antibodies, has been determined by cryo-electron microscopy and was used in the development of vaccines and inhibitors [82,83]. S glycoprotein promotes entry into the cell. Another target of drug design is main protease cutting the initial translated propeptide into functional viral proteins. The crystal structure of the SARS-CoV-2 main protease was also obtained experimentally [84].
To assist the solution of tasks related to general research and drug design, different structure prediction techniques, including AlphaFold, were used for prediction of SARS-CoV-2 proteins [25][26][27][28][29]85]. The main task was probably the investigation of the mechanism of interaction of the SARS-CoV-2 receptor-binding protein (RBP), which is the SARS-CoV-2 spike, and the angiotensin-converting enzyme 2 (ACE2) receptor. AF2 predictions enabled clarification of the structural features of monomeric and multimeric formulations of the vaccine and suggested that monomeric formulation presents more antigenic epitopes [27]. The emergence of new immune-escaping variants of SARS-CoV-2, such as Omicron BA1, made it important to study potential mutation sites that do not yet exist in nature but could increase the binding affinity of RBD and the receptor [29]. AF2 predictions were successfully used to find an explanation for the observed reduction in the neutralisation of SARS-CoV-2 variants of concern compared with other variants [28]. AF2 predictions can be combined with molecular dynamics simulations to improve modelling accuracy [86] and to predict the physical properties of proteins. Such models can be used for studies of both qualitative and quantitative aspects of the formation of the quaternary structure of proteins [85]. AlphaFold models are useful for revealing possible ligand binding sites. Together with virtual screening and in silico validation, these approaches provide the basis for the biological testing of new drugs and for the repurposing of natural products [25].
The accuracy of predicted structures can be assessed using computational techniques [87] and via experimental methods, e.g., optical spectroscopy or measurement of solution residual dipolar couplings data (RDCs) [30,88]. A meticulous evaluation of the concordance of AF2 models of the SARS-CoV-2 homodimeric 3C-like protease (M pro ) with residual dipolar couplings (RDCs) measured in solution for 15 N-1 H N and 13 C -1 H N atom pairs indicated the close agreement of AlphaFold predictions with experimental data (Figure 2) [30].
Interestingly, the high level of accuracy of AF2 predictions makes it possible to use AlphaFold predictions to determine a macromolecular structure from crystallographic diffraction experiments. It has been shown that a template-free AF2 model, generated by the AlphaFold2 group, was of sufficient quality to phase the native SARS-CoV-2 ORF8 dataset by molecular replacement, overcoming the limitations of the crystallographic phasing problem [26]. However, a comparison of RMSD (root mean square deviation of atomic positions) values of SARS-CoV-2 spike RBD, the laboratory-derived structure with both trRosetta-generated models [89] and models generated by AlphaFold v2.1.0, indicated the high level of accuracy of both methods, but the better results were obtained with trRosetta. Reprinted/adapted with permission from Ref. [30]. Not subject to U.S. Copyright. Excluded residues (red) illustrated on a ribbon diagram (PDB code 5R8T; only a single chain is shown, for clarity); residues with missing RDCs are shown in grey and the catalytic dyad is shown in yellow. (d) Q-factors from SVD fits of 1 D NH and 2 D C H RDCs to the included region of all available M pro X-ray structures, plotted as a histogram, with the top-ranked (Amber-relaxed) AF2 models obtained using full, date-limited and sequence-limited implementations marked. (e) Q-factors of all Amber-relaxed models. (f) X-ray structure resolution vs. Q-factor and (g) C α RMSD (relative to 5R8T) vs. Q-factor.
(h) C α wireframe of all 352 PDB structures. Images courtesy of Dr. Adriaan Bax. Reprinted/adapted with permission from Ref. [30]. Not subject to U.S. Copyright.
MPXV DNA polymerase (DNAP) is a very important antiviral drug target. The laboratory-derived structure of MPXV DNAP was deposited in the RCSB PDB database (PDB code 8HG1) in mid-November 2022, and a paper describing this structure was published in January 2023 [92]. Before that, the AF2-derived structure was obtained and used in the search and design of new inhibitors of MPXV DNAP. The molecules found were predicted to bind to the MPXV DNAP with a binding energy comparable to that of brincidofovir and cidofovir. New MPXV DNAP inhibitors are important in the context of possible drug resistance, which can arise due to mutations in proteins of the DNA replication complex (RC). Studies of the effect of mutations in MPXV RC using AF2generated models have suggested similar mechanisms of drug resistance to cidofovir in monkeypox and vaccinia viruses [32]. It appears that the use of highly accurate AlphaFold predictions can assist the forecasting of the emergence of drug-resistant variants of concern to improve preparedness for them.
The molecular mechanism of interaction of tecovirimat with the monkeypox phospholipase D (F13) was studied using AlphaFold models and molecular dynamics simulations [33]. The results suggested a detailed mechanism of inhibition of F13 by tecovirimat ( Figure 3) and supported the efficacy of tecovirimat against monkeypox virus, emphasising the importance of the availability of precise modelling for revealing molecular mechanisms of drug action.  The development of new drugs is barely possible without an understanding of the mechanisms of viral infection. This knowledge can often require robust structural analysis, which can make use of modern deep-learning structure prediction methods. AlphaFold can facilitate the elucidation of the functionality of viral proteins.
Herpesviruses constitute an important group of pathogens that infect animals, including humans. Herpesviruses infect most vertebrates, causing a lifelong latent infection [93]. Herpesviruses belong to the realm Duplodnaviria, class Herviviricetes, order Herpesvirales, and comprise the families Alloherpesviridae, Herpesviridae and Malacoherpesviridae [94]. Human herpesviruses belong to the family Herpesviridae. Herpes simplex virus 1 (HSV-1) (genus Alphaherpesviruse), residing in sensory neurons or sympathetic neurons, has been shown to severely modify infected cells and to remodel the composition and architecture of cellular membranes [35,95,96]. One of the HSV-1 proteins, phosphatase adaptor UL21, mediates dephosphorylation and accelerates the rate of ceramide to sphingomyelin conversion, altering cell membranes and influencing viral replication [35]. AlphaFold-Multimer modelling has revealed the details of the interaction of UL21 and viral protein UL16 and has enabled the suggestion of the functionality of domains of the latter protein using its structural features. Specific protein-protein interactions have been shown to be essential for lipid metabolism [35]. The use of AlphaFold has also shown that another HSV-1 protein, the tegument protein UL37, interacts with the cytoplasmic surface of the lipid membrane, suggesting that UL37 can be a peripheral membrane protein [36]. AlphaFold predictions have suggested the domain organisation of UL37, and assisted experimental studies and molecular dynamics simulation have clarified the structural features and molecular mechanisms of UL37 interactions.
Fundamentally similar tasks concerning research on other viral pathogens of animals, including humans, and plants can be made easier by the use of AlphaFold predictions. These tasks include mechanisms that are crucial for viral attachment, penetration, replication, release and other steps in the viral infection cycle. They can include the investigation of viral proteins and membranes [38,41,43], viral proteins and DNA [39] and studies of viral proteins, glycoproteins and their mutations [37,40,42]. It is noteworthy that AlphaFold predictions are often used as part of an integrated approach, making the planning of experiments easier and improving understanding of the results obtained.

Application of AlphaFold for Research on Bacteriophages
Bacteriophages (a.k.a. phages) are viruses that infect and replicate in bacterial cells alone. Bacteriophages are ubiquitous-they can be found in water, soil and various living organisms [97]. The total number of bacteriophages can be estimated at 10 31 viral particles, which is 10-100 times the number of cells [98]. The total mass of these particles is about a trillion tons [99]. Phages are also members of plant and animal microbiomes, including humans. For example, the human gastrointestinal tract contains more than 10 12 phage virions [100]. The ability of bacteriophages to destroy the cells of pathogenic bacteria attracted the attention of scientists as early as the beginning of the 20th century. In recent decades, interest in bacteriophage therapy has begun to grow, primarily due to the spread of antibiotic resistance. Phage therapy has important advantages [101], including sustained bactericidal activity and "autodosing", wherein the number of phages positively correlates with the number of host bacteria. Furthermore, phages have low intrinsic toxicity, and phage therapy is characterised by minimal disruption of normal flora and the lack of cross-resistance with antibiotics.
The practical use of phages for phage therapy requires an understanding of the structural bases of interactions of the host receptor and phage receptor-binding proteins (RBPs); the latter can include tail fibre and tail spike proteins (TFP and TSP). In addition, phage RBPs, as well as endolysins and ectolysins, the proteins that cause cell lysis, can be used as antibacterial agents by themselves [45,102]. The analysis of the structural features of phage RBPs and lysins can use modern deep-learning techniques, including AlphaFold. Together with experimental studies, AlphaFold predictions can be used to elucidate the domain organisation of TFP, TSP and cell-wall degrading enzymes, to reveal the sites of phage particle binding and enzymatic domains (Figure 4) [45][46][47]52].
of phage RBPs and lysins can use modern deep-learning techniques, including AlphaFold. Together with experimental studies, AlphaFold predictions can be used to elucidate the domain organisation of TFP, TSP and cell-wall degrading enzymes, to reveal the sites of phage particle binding and enzymatic domains (Figure 4) [45][46][47]52]. As well as in the case of eukaryotic viruses mentioned above, AlphaFold predictions can contribute to building the model of the viral particle [48,103] or the virion parts, including the attachment apparatus [46,50] and phage egress machinery [51]. All the steps As well as in the case of eukaryotic viruses mentioned above, AlphaFold predictions can contribute to building the model of the viral particle [48,103] or the virion parts, including the attachment apparatus [46,50] and phage egress machinery [51]. All the steps of phage infection are accompanied by macromolecular interactions that include proteins, so AlphaFold's highly accurate structural predictions can assist in the elucidation of the mechanisms of the formation of the phage nucleus [49], lysogeny maintenance [53] or antiphage defence [44,54]. AlphaFold can also be useful in the trivial but relevant task of phage genome annotation, assisting the prediction of genes' functionality. As of January 2023, 19,499 GenBank sequences, assigned to class Caudoviricetes, contained 1,731,815 coding regions, 67% of which were annotated as hypothetical proteins. In some cases, BLAST search and HMM-HMM motif comparisons fail to assign a function to proteins encoded in phage genomes, but analysis of fold of AF2-derived structures can assist to clarify this function [55].
It seems that no large-scale studies have been published on the accuracy of modelling using AF2 compared with the predictions of other algorithms. However, comparing the predicted average local distance difference test (lDDT) score of the 54 AF2-derived models of the major capsid protein and ATPase subunit of phage terminase indicated an impressive level of accuracy of the predictions [55]. Interestingly, structural predictions of more conserved terminase were more accurate than those of major capsid protein, (terminase lDDT mean: 0.988, median: 0.996; major capsid protein lDDT mean: 0.907, median: 0.929). The average lDDT of the ATPase domains extracted from the ATPase subunit of phage terminase models was even higher (mean: 0.998, median: 0.999). An evaluation of models of the same major capsid proteins, carried out using a different deep-learning algorithm, RoseTTAFold, showed a lower accuracy of prediction (lDDT mean: 0.634, median: 0.649) than with the AlphaFold models ( Figure 5).
of phage infection are accompanied by macromolecular interactions that include proteins, so AlphaFold s highly accurate structural predictions can assist in the elucidation of the mechanisms of the formation of the phage nucleus [49], lysogeny maintenance [53] or antiphage defence [44,54]. AlphaFold can also be useful in the trivial but relevant task of phage genome annotation, assisting the prediction of genes functionality. As of January 2023, 19,499 GenBank sequences, assigned to class Caudoviricetes, contained 1,731,815 coding regions, 67% of which were annotated as hypothetical proteins. In some cases, BLAST search and HMM-HMM motif comparisons fail to assign a function to proteins encoded in phage genomes, but analysis of fold of AF2-derived structures can assist to clarify this function [55].
It seems that no large-scale studies have been published on the accuracy of modelling using AF2 compared with the predictions of other algorithms. However, comparing the predicted average local distance difference test (lDDT) score of the 54 AF2-derived models of the major capsid protein and ATPase subunit of phage terminase indicated an impressive level of accuracy of the predictions [55]. Interestingly, structural predictions of more conserved terminase were more accurate than those of major capsid protein, (terminase lDDT mean: 0.988, median: 0.996; major capsid protein lDDT mean: 0.907, median: 0.929). The average lDDT of the ATPase domains extracted from the ATPase subunit of phage terminase models was even higher (mean: 0.998, median: 0.999). An evaluation of models of the same major capsid proteins, carried out using a different deeplearning algorithm, RoseTTAFold, showed a lower accuracy of prediction (lDDT mean: 0.634, median: 0.649) than with the AlphaFold models ( Figure 5). Figure 5. Comparison of the overall accuracy of predictions made with the Local Distance Difference Test (lDDT), using the DeepAccNet accuracy predictor. MCP_RoseTTAFlold-RoseTTAFlold models of the MCP, MCP_AF2-AlphaFold models of the MCP, Ter_AF2-terminase ATPase subunits models predicted with AlphaFold, ATPase_AF2-ATPase domain of terminase ATPase subunits models predicted with AlphaFold. Reprinted/adapted with permission from Ref. [55]. © 2023 by the authors.

Application of AlphaFold for Evolutionary and Taxonomic Studies
Comparing structural similarity and specific structural features can clarify the evolutionary relationships between proteins. Furthermore, the emergence of new highprecision algorithms for predicting the structure of proteins, including AlphaFold, can enable the identification of evolutionary relationships between highly divergent Figure 5. Comparison of the overall accuracy of predictions made with the Local Distance Difference Test (lDDT), using the DeepAccNet accuracy predictor. MCP_RoseTTAFlold-RoseTTAFlold models of the MCP, MCP_AF2-AlphaFold models of the MCP, Ter_AF2-terminase ATPase subunits' models predicted with AlphaFold, ATPase_AF2-ATPase domain of terminase ATPase subunits' models predicted with AlphaFold. Reprinted/adapted with permission from Ref. [55]. © 2023 by the authors.

Application of AlphaFold for Evolutionary and Taxonomic Studies
Comparing structural similarity and specific structural features can clarify the evolutionary relationships between proteins. Furthermore, the emergence of new high-precision algorithms for predicting the structure of proteins, including AlphaFold, can enable the identification of evolutionary relationships between highly divergent discovered proteins, using the results of structural modelling. The evolution of proteins may be accompanied by the appearance of new domains, and comparative analysis of AF2-derived structures can help reveal patterns of protein evolution. Studies of bacteriophage tail sheath proteins, an important part of phages' contractile injection system, have enabled the identification of the common core domain, including both N-terminal and C-terminal parts. The remaining variable parts consisting of one or more moderately conserved domains have, presumably, been added during phage evolution ( Figure 6) [58]. discovered proteins, using the results of structural modelling. The evolution of proteins may be accompanied by the appearance of new domains, and comparative analysis of AF2-derived structures can help reveal patterns of protein evolution. Studies of bacteriophage tail sheath proteins, an important part of phages contractile injection system, have enabled the identification of the common core domain, including both Nterminal and C-terminal parts. The remaining variable parts consisting of one or more moderately conserved domains have, presumably, been added during phage evolution ( Figure 6) [58]. Figure 6. Examples of the structural architecture of AF2-derived contractile phage sheath proteins [58]. Proteins consisting of two and more domains are superimposed with the modelled structure of the Burkholderia phage BEK tail sheath protein, depicted in the red colour. The schemes on the left show the structural architecture of proteins. The main domain is depicted as a circle, with additional domains represented as squares with rounded corners. The direction of the polypeptide chain, from the N-to the C-termini, is shown with arrows. Reprinted/adapted with permission from Ref. [58]. © 2022 by the authors.
Structural similarity is widely used to evaluate evolutionary relationships between proteins whose amino acid sequence homology level is low or cannot be determined at all [104,105]. The structural similarity between two proteins can be assessed using root-mean-square deviation (RMSD) or other metrics such as template modelling score (TM-score) and DALI Z-score; the latter two metrics have a number of advantages over RMSD [84,106]. Clustering of experimentally determined structures of major capsid proteins using the DALI Z-score has already been used to illustrate the common origin of some viral groups and to cluster prokaryotic viruses [56,104]. Integrated use of both experimental structures and AF2-derived structures can be used for elucidation of evolutionary relationships and taxonomic classification of bacteriophages and eukaryotic viruses [57,59,107]. AlphaFold modelling and subsequent clustering have been used in taxonomic studies of archaeal viruses [56]. Clustering using AlphaFold showed interesting and often biologically meaningful results [55]. Clustering using structures predicted by AlphaFold showed interesting and often biologically meaningful results (Figure 7). It should also be noted that the native state of viral proteins can change according to the state of the viral particle (e.g., empty, full, expanded capsids) and according to the stage of viral particle assembly [108][109][110][111]. The correlation between structural similarity and sequence identity is not absolute due to conformational plasticity, solvent effects and ligand binding [112]. Most of these limitations apply to studies that involve experimentally determined structures, but, hypothetically, they could be exacerbated by structural prediction errors. Therefore, predicting the effectiveness of using AlphaFold for the analysis of structural similarity and evolutionary history, based only on the similarity of the predicted structures, seems to be a difficult task [55].  . Heatmap (a) and dendrogram (b) based on the pairwise Z-score comparisons of 57 major capsid proteins and encapsulin AF models, using DALI. The branch lengths are measured using the DALI Z-score, and the tree was rooted to encapsulin. "A"-archaeal viruses, "E"-eukaryotic viruses, "+"-phages infecting Gram-positive bacteria, and "−"-phages infecting Gram-negative bacteria. Groups correspond to clusters found as a result of structural comparison. Reprinted/adapted with permission from Ref. [55]. © 2023 by the authors. . Heatmap (a) and dendrogram (b) based on the pairwise Z-score comparisons of 57 major capsid proteins and encapsulin AF models, using DALI. The branch lengths are measured using the DALI Z-score, and the tree was rooted to encapsulin. "A"-archaeal viruses, "E"-eukaryotic viruses, "+"-phages infecting Gram-positive bacteria, and "−"-phages infecting Gram-negative bacteria. Groups correspond to clusters found as a result of structural comparison. Reprinted/adapted with permission from Ref. [55]. © 2023 by the authors.

AlphaFold-Multimer and Prediction of Multi-Chain Protein Complexes
Originally, AF2 was designed to predict monomeric protein structures. Consequently, interactions between different proteins, subunits and domains in multimers were not described in the AlphaFold database [61]. As a result, some large multi-domain protein complexes may not have been modelled accurately enough. Several publications have, however, explored how AF2 could be used for predicting both homo-and heteromeric complexes [62][63][64]. Moreover, it has been pointed out that an AI system outperforms standard docking methods in as much as it does not require starting protein structures [62].
In addition, a number of approaches have been developed to make AlphaFold work well for complicated protein structures with multiple bindings. Recent versions of AF2, such as those incorporated into ColabFold, enable multimer structures to be uploaded [63]. They include AlphaFold-Multimer, the extension developed by the DeepMind team, which significantly improves the accuracy of predicting multimeric interactions [19]. This new instrument is an AlphaFold algorithm that is specially modified to use multimeric data and trained on oligomeric proteins. However, there is evidence that this multimeric modification has not succeeded in predicting the key features of some protein complexes [65]. Currently, AlphaFold-Multimer does not include the self-distillation of multimer predictions, so the authors believe there is potential for future accuracy enhancements.
To overcome the limitation described above, combining AF2 with experimental methods, e.g., cryo-electron tomography and/or other computer-based tools such as RoseTTAFold, provides more robust results [64,66]. Other authors have suggested combining AlphaFold models of protein complexes with differential covalent labelling mass spectrometry data by applying RosettaDock [67]. The use of cryo-electron microscopy maps, integrated with AlphaFold, for multi-chain protein complex prediction also encourages the creation of accurate and reliable models [68].
Other approaches include the use of optimised multiple sequence alignment together with AF2 [69] and the application of a Monte Carlo tree search [70]. The latter works well but only with symmetric protein complexes and when the stoichiometry of the subcomponents is known.

AlphaFill
A study from Massachusetts Institute of Technology, which mainly focused on the limitations of AF2 in the drug industry [74], showed that the use of AF2 together with molecular docking simulations to predict protein-ligand bindings demonstrated poor performance that, in some cases, was comparable to pure chance. At the same time, this study indicated how prediction accuracy might be improved with the integration of machine-learning-based approaches. The authors of the study expected their research to encourage the development of machine-learning methods that would complement AlphaFold.
AlphaFill is a new tool that has been developed to solve the problem with ligands and cofactors in the AlphaFold protein structure database [75]. AlphaFill uses an algorithm that employs sequence and structure similarity analysis to graft missing molecules and ions from experimental data into predicted protein structures. The algorithm has been successfully validated against experimental structures.

Critique of AlphaFold
AlphaFold has probably revolutionised the determination of protein molecular structure. Today, AF2 is a state-of-the-art deep-learning tool that demonstrates an accuracy in predicting protein folding that was previously unattainable using computational tools. The quality of its predictions is, however, not consistent. Furthermore, in some cases, Artificial Intelligence (AI) systems are unable to provide highly accurate results. As reported by the EMBL's European Bioinformatics Institute, 35% of the more than 214 million AF2 predictions have been found to be very accurate [60], which indicates that its predictions are often not inferior to those obtained experimentally. It should also be pointed out that 45% of these predictions still could be used for some applications, in spite of their accuracy being inferior to that of experimentally retrieved structures. Therefore, although AF2 is an outstanding tool, it is important to consider its limitations to ensure that investigations provide reliable results.

Intrinsically Disordered Proteins and Intrinsically Disordered Protein Regions
When AlphaFold encounters difficulties with obtaining highly accurate predictions, the problem very often relates to intrinsically disordered proteins (IDPs) or intrinsically disordered protein regions (IDRs) [71]. AI systems perform excellently when predicting well-folded proteins, but about a third of eukaryotic proteins are intrinsically disordered or contain disordered regions [72]. Moreover, IDPs play an important role in physiological functions, such as in protein signalling networks.
The reason for AlphaFold encountering difficulties when predicting IDRs may be that these proteins and regions are often not solved by X-ray crystallography; AF2 is mainly designed to use X-ray data [62]. There is a database, DisProt, that contains consolidated information on IDPs [72]. If AF2 or another AI system could be tailored so that it can extract conformational features from DisProt or some other experiment-based databases, then this might enable prediction of IDPs/IDRs in the future.

Protein Interactions with Metal Ions, DNA, RNA, Cofactors, Ligands and Post-Translational Modifications
Many proteins can physiologically function only in the form of complexes with various ions and molecules, such as hemoglobin. Such interactions are especially crucial for drug discovery. It is to be expected, therefore, that much of AlphaFold's criticism is related to the fact that it omits protein-ligand interactions in its predictions [18,73].
AlphaFold is not designed for the prediction of post-translational modifications (PTMs) of proteins, such as protein glycosylation. This fact has attracted the attention of the scientific community, with recent studies demonstrating the relevance and importance of glycosylation in the SARS-CoV-2 spike protein or in human proteins. According to research, between 50% and 70% of the 20,000 predicted human proteins are thought to be glycosylated [113]. Bagdonas et al. suggested that the use of sequence-and structure-based studies might address not only the ligand and cofactor interactions problem but also issues related to PTMs [76]. The authors presented an example of glycosylation to demonstrate the potential of their proposed approach, developing an algorithm integrated into Privateer software. This tool 'transfers' protein glycosylation from a library of structurally balanced glycan blocks to the protein folding from AlphaFold.

Protein Conformations
Proteins are not static; they take on various structures, depending on their surroundings or the stage in the functional cycle. Conformational changes in proteins are closely related to their functions and regulations. They can be caused by binding to other molecules, by PTMs or by changes at the pH and temperature levels, for example. AlphaFold provides a static picture of protein folding and does not incorporate information about its dynamics [77]. There is also no clarity as to which conformation of the protein will be predicted by AlphaFold [61]. Consequently, this AI system offers only partial information about the key features of the relationship between protein structure and function.
The situation is complicated by the fact that data on these conformations obtained under experimental conditions also have limitations. Nevertheless, at the moment, it seems that predictions of conformations and the dynamics of protein structures are only possible using experimental methods, such as time-resolved crystallography and structural distributions from cryo-EM data [78].

Mutations
According to some studies, it appears that AF2 is unable to predict defects in protein structures caused by mutations [79]. One investigation showed that differences between mutated and wild-type structures predicted by AlphaFold were extremely small [80]. Other researchers have found that it is impossible to obtain a reliable evaluation of the impact of mutation on protein stability with the direct application of AI predictions [81]. Thus, predicting the effect of mutations on protein stability should be carried out as a specific task, although this will be hampered by the limited amount of data available for training deep-learning models.

Database Loopholes
As a deep neural network, AF2 cannot correctly predict absolutely unknown structures on which it was not trained. It is based on MSA and experimentally obtained structures stored in the database. Similarly, AF2 also lacks predictive accuracy where fewer sequences are available for alignment [65]. Accordingly, the AI's quality performance will depend on how much experimental and previous computational data have been collected and stored in databases. This is not really a limitation, since it may be considered as an opportunity, given that the more data that are collected, the more accurate predictions will become.

Conclusions
Protein structure modelling is an important task that helps fundamental and applied research in the field of virology. The AlphaFold deep-learning algorithm, which has been proven to be a highly accurate prediction method, can be used in the design of new drugs and in studies of viral pathogens and mechanisms of viral infection. In bacteriophage research, AlphaFold predictions can also be used to model receptor-binding proteins and glycopolymer-degrading enzymes, helping to develop new antibacterials and biocontrol agents.