Convergent evolution of the Hedgehog/Intein fold in protein splicing

The widely used molecular evolutionary clock assumes the divergent evolution of proteins. Convergent evolution has been proposed only for small protein elements but not for an entire protein fold. We investigated the structural basis of the protein splicing mechanism by class 3 inteins, which is distinct from class 1 and 2 inteins. We gathered structural and mechanistic evidence supporting the notion that the Hedgehog/INTein (HINT) superfamily fold, commonly found in protein splicing and related phenomena, could be an example of convergent evolution of an entire protein fold. We propose that the HINT fold is a structural and biochemical solution for trans-peptidyl and trans-esterification reactions.

cyclization, and (4) S(O)-N acyl shift (Fig. 1a) 10 . Inteins undergoing the canonical splicing mechanism are referred to as class 1 inteins (Fig. 1a) 11 . Not all of the four steps are exploited among all Hedgehog/INTein (HINT) superfamily members, which all share the same flat horseshoe-like HINT fold and catalyze protein splicing as well as related reactions (Fig. 1b) 8,12 .
For example, the C-terminal domain of the Hedgehog protein (Hh-C or hog domain), a member of the HINT superfamily, uses the first step of the N-S acyl shift for cholesterol modification of the N-terminal signaling domain (Hh-N) 8,12 . Bacterial Intein-Like (BIL) domains lack the nucleophilic +1 residue of inteins essential for the trans-esterification step in the proteinsplicing reaction and produce predominantly cleaved products 13 . Some inteins do not undergo the canonical splicing reaction of class 1 inteins. Inteins without the first nucleophilic residue required for the initial N-S(O) acyl shift step were originally termed class 2 ( Fig. 1a) 14 Fig. 1a) 11,15,16 . Class 3 inteins are thus classified as a distinct class of inteins from class 2 inteins.
Whereas the first residue for class 1 inteins can be cysteine or serine, the C-terminal nucleophilic residue at the +1 position of inteins is usually either cysteine, serine, or threonine ( Fig. 1a). Although the penultimate histidine residue and histidine residue in block B are highly conserved among many inteins, several inteins lack them and remain capable of catalyzing protein splicing by compensatory mutations 19,20 . Inteins catalyzing protein splicing are thus unique single-turnover enzymes that tolerate high sequence variations at the active site residues even among the same class of inteins. Inteins do not have strict requirements for the active site residues but utilize slightly different protein-splicing mechanisms by compensating mutations.
Members of the HINT superfamily have been considered to have evolved from a common ancestor by divergent evolution. Although the HINT fold can be easily detected based on the sequence homology, significant deviations of the active-site-residue combinations at all critical residues have been observed 16,18 . How have inteins escaped from the degradation without providing any apparent benefit to their host organisms? How did they evolve into different splicing mechanisms despite the low sequence conservation and high variation of catalytic residues? In this work, we address these questions by revealing the structural basis for the protein splicing mechanism of class 3 inteins by crystal structures, molecular dynamics simulation, and structure-based protein engineering. We propose that the HINT fold could be an effective structural solution for the protein-splicing reaction and an example of protein structure convergence evolved from different ancestral proteins.

Results
In order to provide an understanding of the class 3 intein splicing mechanism, we decided to determine their structures. We first found that the DnaB1 intein from Mycobacterium chimaera  Table 1). The MchDnaB1 intein structure shares the typical HINT fold of class 1 and class 2 inteins, which is in line with the previous report of class 3 intein structure ( Fig. 1b and 1c) 12,23 . Thus, the class 3 MchDnaB1 intein is indistinguishable from class 1 and class 2 inteins by comparing their backbone conformations because additional insertions and deletions observed among inteins easily mask their differences (Fig. 1c) 21 . We found that the most striking feature in the structures of the MchDnaB1 intein is the active site, closely resembling the catalytic triad of serine/cysteine proteases. The observed distance (5.5-5.7 Å) between Sg and Nd atoms in the MchDnaB1 inteins is slightly longer than in typical cysteine proteases (3.8-4.0 Å) ( Fig. 2a and 2b) 22 . The WCT motif found in the class 3 intein participates in forming the catalytic triad, in which C124, H65, and T143 could serve as nucleophilic, basic, and acidic functional groups, respectively ( Fig. 2a and 2b). Importantly, we could observe a large electron density near the side-chains of C124, H65, and the backbone of residue 125 for both crystal structures of the MchDnaDB1_HN and MchDnaB1_HAA inteins (modeled as oxyanion waters in Fig. 2a and Supplemental Fig. S2). This electron density could be the oxyanion hole that is commonly observed in the crystal structures of serine/cysteine proteases, stabilizing the tetrahedral reaction intermediate ( Fig. 2a and 2b) 22 . In the class 3 intein structure, Thr143 serves as the protonating acidic residue instead of aspartic acid in the typical Ser-His-Asp catalytic triad of serine proteases. The weaker acidity of Thr compared to Asp might lower not only the nucleophilicity of Cys124 but also increase the distance between His65 and Cys124. However, inteins are single turnover enzymes requiring only one splicing reaction per molecule, rendering high reactivity redundant. Thus, the Cys-His-Thr catalytic triad in MchDnaB1 intein could be sufficient for creating the acyl-enzyme intermediate similar to one found in many serine/cysteine proteases.

Self-cleavage activity and inhibition of class 3 inteins by protease inhibitors
Both variants of the MchDnaB1 intein were produced for crystallization as N-terminal SUMO fusion proteins, resulting in the N-terminal "SVGK" extein sequence after Ulp1 protease treatment to remove the SUMO fusion tag. However, the crystal structures of both MchDnaB1 intein variants (HN and HAA at the C-terminus) lacked electron densities for the N-extein sequences. This observation is apparently due to self-cleavage at the N terminus during crystallization (N-cleavage) 23 . We also confirmed the N-cleavage activity in vitro by incubating Due to its small size, H2O2 could easily access to the oxyanion hole, thereby oxidizing Cys124, while PMSF may be sterically-hindered in accessing the active-site cysteine residue, as inteins process an intramolecular substrate. These observations corroborate the notion that a class 3 intein might utilize a catalytic triad similar to serine/cysteine protease for producing the acyl-enzyme intermediate. While most inteins are generally auto-catalytically spliced out immediately after protein translation, the MCM2 intein from Halorhabdus utahensis is inactive under low salinity but can be activated at a high salt concentration 26 . To further verify the class 3 splicing mechanism, we used the salt-inducible HutMCM2 intein for testing the effect of H2O2 on the N-cleavage of a class 1 intein in an in vitro model 26 . We found that H2O2 did not inhibit N-cleavage of the salt-inducible class 1 intein at a high salt condition, further supporting the protease-like acyl-enzyme intermediate for the class 3 splicing mechanism (Supplemental Fig. S4).

Conversion of a class 1 intein into a class 3 intein
BIL domains, additional members of the HINT superfamily, predominantly produce N-and C-cleaved products in contrast to protein-splicing domains. BIL domains have probably evolved from inteins divergently 13,27 . We and others previously demonstrated the reverse engineering of BIL domains into efficient cis-splicing domains 27,28 . This simple conversion from a BIL domain into a protein-splicing domain implies divergent evolution of BIL domains from an ancestral intein by genetic mutations. Likewise, class 2 inteins lacking Ser or Cys at the N terminus could also efficiently splice after replacement of Ala at the +1 position by Cys or Ser, suggesting a clear evolutionary connection to class 1 inteins 14 .
To examine the divergent evolution of class 3 inteins from class 1 inteins as previously demonstrated with class 2 intein and BIL, we tested the conversion of a class 1 intein into a class 3 intein. Grafting the unique WCT motif found in class 3 inteins to a class 1 intein with the first Cys/Ser to Ala mutation could result in a functional cis-splicing intein if they were related by a divergently evolved lineage. We chose the class 1 gp41-1 intein as a model because it already has Thr at the position corresponding to the WCT motif of class 3 inteins and the 1.0 Å-resolution crystal structure is available, facilitating the WCT motif engineering 29 . We grafted the WCT motif to the gp41-1 intein based on the amino-acid sequence alignment (Fig. 3a).
However, the engineered class 3 gp41-1 intein (gp41-1_WCT) produced dominantly the Ccleaved product and only a minute amount of the possible splicing product. This result clearly shows that class 3 intein requires additional compensatory mutations in addition to the WCT motif for productive protein splicing (Fig. 3b). To better understand the structural basis for nonproductive splicing of the engineered class 3 intein, we solved the crystal structure of gp41-1_WCT at 1.85 Å resolution ( Fig. 3c and 3d). Unlike in the crystal structures of in which protein-splicing variants were created by simple mutations. This reverse engineering suggests that a class 3 intein requires additional compensatory mutations in addition to the WCT motif to be proficient in protein splicing. Such simultaneous compensatory mutations on class 1 or 2 inteins together with the WCT motif is an improbable event according to the current survival model of inteins, which are usually inserted near the active site of enzymes essential for host organisms 14,27,28 . A plausible alternative explanation for the emergence of class 3 inteins is that they have gone through a unique evolutionary pathway different from other HINT members.

The active site of the MchDnaB1 class 3 intein
Despite sharing the same HINT fold, class 3 inteins appear to utilize a very different approach for the same protein splicing reaction in contrast to other members of the HINT superfamily 10,12,16 . Available intein structures containing the extein sequences, except for the two coordinate sets of SceVMA and PhoRadA inteins, typically have large distances (∼8-9 Å) between the N-scissile peptide and the nucleophilic side chain of the +1 residue that is responsible for the second step, namely trans-esterification 30,31,32 . These longer distances suggest the necessity of substantial conformational changes for class 1 inteins during protein splicing. We observed electron density for both the gauche+ and trans-like conformations of intein variants might suggest that it could be a driving force for the splicing reaction in class 3 inteins.

The catalytic mechanism of class 3 inteins
Based on biochemical and structural data as well as MD simulations, we propose the catalytic mechanism of class 3 inteins, as depicted in Fig. 4c. At the pre-splicing state, Cys124 is at the high-energy (unfavorable) trans-like conformation and is weakly deprotonated by His65. The

Discussion
One protein fold may serve as a common scaffold for many functions. For example, the eightfold (βα) barrel structure, known as TIM-barrel, is the most common protein fold utilized by many different enzymes with very diverse amino-acid sequences 33 . Whereas a specific protein fold might not be a prerequisite for the function of a protein, the catalytic triad found in proteases is often considered as a prime example of convergent evolution 2 . This convergent evolution is assumed because it is unlikely that two proteins evolving from a common ancestor could have retained similar active-site structures while other structural features have completely changed 1 . Many serine/cysteine proteases, such as chymotrypsin/trypsin, share the two-barrel motif as the core -a result of presumable gene duplication (Fig. 5) 34 . The acid-histidinenucleophile catalytic triad motif of serine/cysteine proteases is located at the interface of the two b-barrels and considered to be the result of convergent evolution 3 . Even though the common horseshoe-like fold of the HINT superfamily members, including inteins, does not have two distinct b-barrels, the HINT fold contains two subdomains related by the pseudo-C2related symmetry 12 . This symmetry relation is considered to be the result of gene duplication, fusion, and loop-swapping events 12,34 . The catalytic triad formed by Cys124-His64-Thr143 in class 3 MchDnaB1 intein is analogously split between the two subdomains and located at the interface. The catalytic triad being at the interface of the two subdomains of the HINT fold resembles the common catalytic triad of serine/cysteine proteases, including the oxyanion hole stabilizing the tetrahedral intermediate during catalysis (Fig. 2a). Since peptide bond formation is the reverse reaction of peptide hydrolysis, it is not surprising that protein splicing uses the same mechanism as cysteine proteases involving a tetrahedral intermediate. Indeed, several peptidases have been used for trans-peptidase reactions 35,36 . Inteins tolerate a vast array of variations at the active site for protein splicing, leaving the Nterminal Ser/Cys and C-terminal Asn/Gln/Asp as the only omnipresent amino-acid residues among class 1 inteins because even a highly conserved histidine in block B and penultimate histidine are substituted in several inteins 19,20 . These conserved residues can be further reduced to the C-terminal Asn for class 2 inteins, yet retaining the protein splicing activity by different combinations of the catalytic residues and compensatory mutations. One way to explain the extremely high tolerance of the active sites of inteins is that the HINT fold is the critical structural solution enabling peptidyl transfer reactions. In the HINT fold, the enzymes (inteins) and substrates (exteins) are covalently connected as single precursor molecules, thereby working as single-turnover enzymes. Inteins do not involve any substrate-association step. The covalent linkage to its substrates could also facilitate the accommodation of different aminoacid types at the active site residues among the HINT superfamily compared with other enzymes.
The critical role of the HINT fold is to bring the acyl-(thio)ester intermediate and the nucleophilic residue from the C-extein close together, at the precise position and timing required for protein splicing. We gathered evidence suggesting that class 3 inteins might have evolved through a different pathway than class 1 and 2 inteins, possibly related to serine/cysteine proteases originated from prophage because class 3 inteins have a clear monophyletic distribution and inactive class 3 intein sequence was found within a pseudogene 16,17 . We revisited what would be the possible common ancestral protein of other members among the HINT superfamily. We searched with the BIL coordinates (2lwy) 27 the Protein Data Bank (PDB) using DALI server 37 and identified possible ancestral domains corresponding to the C2-related pseudo-symmetry subdomain in the HINT fold (Supplemental Table S2). Despite their low Z-scores (2.5-2.7), we noticed structural homology to translation initiation factor 5A (1bkb) 39 , eukaryotic translation initiation factor 5A2 (3hks) 40 , and elongation factor P (1ueb) 41  In summary, we identified possible convergent evolution of the HINT fold in protein splicing by deconvoluting the existing intein structures and their reaction mechanism into possible ancestral proteins with distantly related origins. Despite the identical HINT fold, the proteinsplicing mechanisms seem to have widely diverged, which cannot be explained by the divergent evolution model by random mutations, because inteins would require several concurrent compensatory mutations for their survival. The extremely high diversity of the active-site residue combinations found in protein splicing could be reminiscent of independent evolutionary pathways originating from the distantly related ancestral proteins shared with proteases and translation initiation factors, yet leading to the same structural solution, i.e., the HINT fold. We propose that the HINT fold is an effective structural and biochemical solution for trans-peptidyl reactions and the first example structural convergence of a whole protein.

Deconvoluting functional mechanisms and ancestral structural protein domains might assist in
identifying further examples of structural convergence of various other protein folds.

Cloning of class 3 intein expression vectors
The gene encoding the MchDnaB1 intein () was amplified from the genomic DNA of PCR product was ligated into pBHDuet37 29   During the two IMAC purification steps, proteins were dialyzed against the following buffers:

Protein cis-splicing tests
To assay protein cis-splicing, the vectors encoding the inteins MsmDnaB1 (pHBDuet060), DraSnf2 Δ131 (pHBDuet057) were expressed in E. coli strain T7 Express (New England Biolabs, Ipswich, USA) as described in section "Proteolytic inhibition assays". Proteins were purified and analyzed as described above.  Table 1). The structure was solved by molecular replacement using PHASER with the MsmDnaB1 intein (6bs8) as a search model 44,23 . The model was built using PHENIX AutoBuild, manually corrected with COOT, and refined using PHENIX 46 . The final model consists of two molecules in the asymmetric unit. The four residues of the sequence SVGK preceding Ala1 of the intein were clearly missing in the electron density. A loop region between residues Ser91 -Leu104 (chain A) and Gly90 -Leu104 (chain B) was not modeled due to insufficient density information. The electron density for the side chain Cys124 suggested that it was oxidized and was therefore modeled as S-oxy cysteine (Csx). Alternate conformations were modeled for Thr15, Asp19, Arg46, and Csx124 (chain A) and Cys124 and His144 (chain B). The final model includes one Clion originating from the crystallization buffer. The structure was validated using MolProbity (score 1.07, 100 th percentile) 47 .
MchDnaB1_HAA crystals were obtained as described above using concentrated protein (13 mg/mL) after adjusting the DTT concentration to 10 mM and mother liquid (100 mM Tris-HCl  Table S1). The structure was solved by molecular replacement using PHASER with the MchDnaB1_HN structure as a search model 44 . The structure model was built using ARP/wARP, manually corrected using COOT and refined using PHENIX 45,46,49 47 .
Data were collected at beamline I04 at Diamond Light Source (Didcot, UK) equipped with a Pilatus detector and 1.85 Å (Supplemental Table S1). The structure was solved by molecular replacement using PHASER with the gp41-1 intein (6qaz) as a search model 44 . The structure model was built using PHENIX AutoBuild, manually corrected with COOT and refined using PHENIX 46,45 . The entire protein chain (one molecule in the asymmetric unit) could be traced in the electron density without breaks for all 128 residues except for the first Ser residue. A noncanonical cis peptide bond was modeled between Lys87 and Glu88, which is also found in the search model. Alternate conformations were modeled for Leu25, Ser28, Val38, and Ser46.
Additional density was observed for the sidechain of Cys83 indicating oxidation and was modeled as 3-sulfinoalanine (Csd). The structure was validated using MolProbity (score 1.28, 99 th percentile) 47 .

Molecular Dynamics Simulation
We residues in a loop region (see above). After modeling the missing residues with MODELLER software 49 , the crystal structures were used as the starting structure for the simulation without the N-terminal residues. The four-residue N-extein ("SVGK") was modeled on the structure to generate the initial structure for the MD simulation with the N-terminal residues using the MODELLER software 49 . The crystal structure of gp41-1_WCT (6riz) contained all the residues, including the N-extein part of the "GG" sequence, and it was used as the starting structure for the simulation with the N-extein part. The initial structure of the gp41-1_WCT simulation without the N-extein part was derived by removing the first two glycine residues from the crystal structure.
The MD simulations were performed using Gromacs 2018 software 51               Path B

Common ancestor
Gene duplication

Common ancestor
Domain swapping Figure 5 Lethal to Host & Low probability