Comparative Analysis and Ancestral Sequence Reconstruction of Bacterial Sortase Family Proteins Generates Functional Ancestral Mutants with Different Sequence Specificities

: Gram ‐ positive bacteria are some of the earliest known life forms, diverging from gram ‐ negative bacteria 2 billion years ago. These organisms utilize sortase enzymes to attach proteins to their peptidoglycan cell wall, a structural feature that distinguishes the two types of bacteria. The transpeptidase activity of sortases make them an important tool in protein engineering applications, e.g., in sortase ‐ mediated ligations or sortagging. However, due to relatively low catalytic efficiency, there are ongoing efforts to create better sortase variants for these uses. Here, we use bioinformatics tools, principal component analysis and ancestral sequence reconstruction, in combination with protein biochemistry, to analyze natural sequence variation in these enzymes. Principal component analysis on the sortase superfamily distinguishes previously described classes and identifies regions of relatively high sequence variation in structurally ‐ conserved loops within each sortase family, including those near the active site. Using ancestral sequence reconstruction, we determined se ‐ quences of ancestral Staphylococcus and Streptococcus Class A sortase proteins. Enzyme assays re ‐ vealed that the ancestral Streptococcus enzyme is relatively active and shares similar sequence vari ‐ ation with other Class A Streptococcus sortases. Taken together, we highlight how natural sequence variation can be utilized to investigate this important protein family, arguing that these and similar techniques may be used to discover or design sortases with increased catalytic efficiency and/or selectivity for sortase ‐ mediated ligation experiments.


Introduction
Gram-positive bacteria accounted for 76% of all bloodstream infections in 2000, up from 62% in 1995 [1].Although varied by region and over time, these numbers have stayed relatively consistent for the past 20 years [2][3][4].These organisms are defined in part by their thick peptidoglycan layer as compared to gram-negative bacteria, which they diverged from roughly 2 billion years ago [1,5,6].Sortase enzymes are critical for the ability of gram-positive bacteria to attach proteins to the cell exterior, as well as to build the pili [7][8][9][10].Due to this activity, sortases are a potential therapeutic target for antibiotic development, and they are actively-used tools for protein engineering [11,12].Several of the infections mentioned above are caused by pathogenic Staphylococci and Streptococci, e.g., Staphylococcus aureus and epidermidis, and Streptocococcus pneumoniae, pyogenes, and agalactiae [1].Therefore, a greater understanding of proteins from these organisms may prove valuable in the fight against gram-positive bacterial infection.
There are six main classes of sortase (class A-F); the first-characterized and best-studied bacterial sortase is the Class A sortase from Staphylococcus aureus (saSrtA) [13].This enzyme recognizes the Cell Wall Sorting Signal (CWSS) sequence LPXTG, where X=any amino acid.Following cleavage of the initial protein target, an acyl-enzyme intermediate is formed.A secondary substrate then acts as a nucleophile, and a final ligation product is generated [9].Peptidase activity occurs between the Thr and Gly residues, and positions are defined as P4 = Leu, P3 = Pro, P2 = X, P1 = Thr, and P1′ = Gly.Other Class A sortases, e.g., Streptococcus pyogenes SrtA (spySrtA), are predicted to contain a closely related recognition mechanism, and our group recently showed that recognition of the P1′ residue is partially mediated by residues in the 4-5 and 7-8 loops, highlighting the importance of these conserved structural features [14,15].
The catalytic activity of sortases make them an exciting tool in protein engineering, where sortase-mediated ligation (SML) or sortagging techniques are commonly employed to create a variety of products, including the recent development of an in vivo assay using engineered saSrtA to label amyloid- protein in human cerebrospinal fluid and the implementation of ligation site switching to allow assembly of multiple fragments using a single sortase enzyme, amongst many others [11,[16][17][18].Despite their uses, sortagging applications are hindered by the poor relative enzymatic efficiency of saSrtA and other naturally occurring sortases studied to date [19][20][21].Directed evolution studies performed in 2011 were successful in generating a saSrtA pentamutant (P94R/D160N/D165A/K190E/K196T) with an overall catalytic efficiency increase >100-fold [21].Engineering of additional variants of saSrtA and other Class A sortases is an area of ongoing work.An example includes the incorporation of two additional mutations to the saSrtA pentamutant at the calciumbinding site, which led to a calcium-independent saSrtA heptamutant [20,[22][23][24].Other studies use directed evolution or other engineering techniques to alter the substrate specificity of saSrtA, e.g., a recent study that reported an saSrtA variant which recognizes an LMVGG substrate motif in the amyloid- protein [17].
Variation in substrate selectivity also naturally exists amongst bacterial sortases.Although saSrtA is selective for the LPXTG target sequence, this is not true of all Class A sortases.Work from ourselves and others revealed that other Class A sortases can recognize a variety of amino acids at multiple positions [14,15,25,26].A complete understanding of the selectivity determinants of these alternate preferences is not known.Furthermore, there are six known classes of sortases (A-F).Many of these classes share a similar recognition motif as Class A sortases, including Classes C-F (Class C: [I/L][P/A]XTG; Class D: LPNTA; Class E: LAXTG; Class F: less is known, but it is likely similar to SrtA, LPXTG) [27,28].However, the recognition motif of Class B sortases is NP[Q/K]TN [27].Taken together, we hypothesize that investigating sequence variation of individual classes of sortases, as well as the sortase superfamily, may identify sortases with improved catalytic efficiency and/or unique recognition motifs.
Ancestral sequence reconstruction (ASR) is a powerful technique that combines our growing knowledge of the proteomes of extant organisms with statistical methods in order to predict the sequences of ancestral proteins [29].These ancestral proteins can then be characterized, providing evolutionary clues to sequence-function relationships in a growing number of protein systems, including classic models, e.g., recent work on the origin of cooperativity in hemoglobin [30].A number of studies suggest that ancestral proteins are less selective for target ligands and more thermostable than extant sequences [31].Therefore, we propose that ASR can be used as a method for identifying improved sortase sequences for protein engineering.
Here, we used principal component analysis (PCA) and ASR to study the sortase superfamily and Class A sortase sequence variation, respectively.Using PCA, we show that the main source of natural variation within sortase families occurs in a number of structurally-conserved loops near the active site.Using ASR, we characterized ancestral proteins of the genera Staphylococus and Streptococcus.While our ancestral Staphylococcus protein revealed lower relative activity than saSrtA, the ancestral Streptococcus enzyme had the second-highest activity of the four Streptococcus SrtA proteins studied in similar experiments [14,15].Interestingly, the ancestral Streptococcus SrtA showed markedly increased activity and P1 promiscuity, as compared to its extant S. pneumoniae relative [14,15].Although ancestral sortases from nodes that included multiple genera were expressed and purified, these enzymes were catalytically inactive, due to a number of potential factors.Overall, our work suggests that the ancestral Streptococcus protein was relatively more active as compared to its extant relatives and that the ASR technique provides a viable approach for exploring sequence variation in sortases from the same genera.

Principal Component Analysis (PCA) of Bacterial Sortases
In order to gain a better understanding of global sequence patterns in the sortase superfamily, we used PCA to group and analyze 39,188 sortase sequences from all classes.This work builds off of recent studies that utilized a sequence similarity network to classify sortases [27].Briefly, we downloaded all sequences annotated as "sortase" from UniProt and aligned them by MAFFT, followed by PCA [32,33].The amino acids in each sequence were then classified by five parameters: hydrophobicity, disorder propensity, molecular weight, charge, and occupancy (defined as a binary value, where 1 = amino acid and 0 = insertion or deletion (indel) at this position) [34,35].PCA was then performed on the resulting matrix.For visualization purposes this data was projected onto the first three principal components which describe 42.7% of the total variance (Figure S1a).Additionally, we performed Hierarchical Gaussian Mixture Model clustering of the sortase superfamily, as described in the Materials and Methods.On the entire principal component space we hierarchically fit a two Gaussian mixture model to the data until each subcluster reached a minimum size or the Gaussian mixture modeling process failed to identify two distinct Gaussians [36].The resulting tree from this process can accurately distinguish the known sortase classes, as well as extract small subclusters of sortases and present them in a readable manner (Figure 1a).We also plotted our PCA using the top three principal components (Figure S1b).For visualization, we ran PCA on a subset of the data, including 9427 sequences that were filtered for low numbers of indels and manually verified (Figure 1b). is used as the model to show variance in a "typical Class A sortase."The catalytic residues (H142, C208, and R216) are shown as side chain sticks and colored by heteroatom.The three structurally conserved loops that are discussed in this work are labeled.We focused on variance near the active site here, but notably, there is also a relatively large degree of variance on the other side of the protein (also Figure S1d).This analysis verified previous classifications of sortases based on sequence alignment, network, and phylogenetic tree analyses [27,28,37].For example, principal component 1 (PC1) separates the sortase F proteins from the rest of the superfamily and PC2 captures the separation between sortase B and the other sortase families, as well as sortase E and sortase A. These analyses allowed us to identify the regions of highest variability within each class based on the parameters defined above.We plotted our data onto previously determined sortase A structures by taking the distance from the centroid for each position in the multiple sequence alignment (Figure S1c).Consistent with expectations, we found that secondary structure elements are highly conserved, including the "sortase fold" -barrel core and class-specific -helices (Figures 1c-d and S1d).Additionally, PCA revealed that the highest degree of variability occurs in structurally conserved loops adjacent to the substrate recognition pocket (Figures 1c-d and S1d).
Given that the 6-7 loop has been shown to be intimately involved in sortase substrate recognition in Staphylococcus aureus SrtA (saSrtA), we were intrigued that PCA revealed similar levels of variability in the 4-5 and 7-8 loops [38].In the case of 7-8, we were also motivated by previously reported mutations in the 7-8 loop of saSrtA that have been shown to dramatically modulate sortase reaction rates [8,21,39].Indeed, our work confirms that the 7-8 loop dramatically affects the activity and substrate specificity of a sortase with narrow substrate tolerance, e.g., saSrtA, versus those that are more promiscuous, e.g., the Streptococcous SrtA proteins from S. pneumoniae, S. pyogenes, and S. agalactiae [14,15].

Ancestral Sequence Reconstruction of Class A Sortases
Building off our PCA analysis of sortase families, we wanted to further explore sequence space in these enzymes by performing ancestral sequence reconstruction (ASR) on Class A sortases.As detailed in the Materials and Methods, ultimately 400 sequencesincluding 7 SrtB sequences used to anchor the resulting phylogenetic tree-were used for the sequence reconstruction (Figure 2).Initially, we chose to characterize ancestral proteins at ancestral nodes for two SrtA genera with well-characterized family members, Staphylococcus and Streptococcus (Figure 2).We will refer to these proteins as ancStaphSrtA and ancStrepSrtA, respectively.Multiple sequence alignments of our ancestral proteins with representative extant sequences reveals approximate values of 78.3% identity for ancStaphSrtA with another Staphylococcus SrtA sequence, and 69.5% identity for ancStrepSrtA with another Streptococcus SrtA sequence (Table 1, Figure 3).These are similar values to pairwise sequence alignment identities for most of the other representative extant sequences chosen (Table 1, Figure 3).These values represent an average of 30.67 and 49.67 mutations over the aligned regions for ancStaphSrtA and ancStrepSrtA with representative extant sequences, respectively.To characterize the ancestral proteins, we recombinantly expressed and purified these enzymes and ran activity assays, as described in the Materials and Methods.Wildtype proteins were expressed and purified as previously described [14].All protein sequences are included in the Supplemental Information.Briefly, activity was assessed using a FRET-based assay utilizing probes with a 2-aminobenzoyl fluorophore (Abz) and a 2,4-dinitrophenyl quencher (Dnp) on either side of a substrate motif, e.g., Abz-LPATGG-K(Dnp), where the P1′ Gly is in bold [8,14,15,26,40].We varied the P1′ position to all amino acids and tested the two proteins (Figure 4).Similar data for spSrtA was previously reported [14].Consistent with known target sequences and our previous work, our results revealed that although its activity was reduced >2-fold, ancStaphSrtA was similar to saSrtA and remained selective for a P1′ Gly residue [14,25,26].In contrast, ancStrepSrtA could recognize several amino acids at the P1′ position, showing increased promiscuity, specifically Ala, Cys, Asn, and Ser, in addition to Gly [14,15,25,26].As compared to spSrtA, the activity of ancStrepSrtA was increased ~2-3-fold.In comparison to our previous work on the in vitro activity of other Streptococcus SrtA proteins, ancStrepSrtA also appeared to be more active than S. agalactiae SrtA (sagSrtA), but less active than S. pyogenes SrtA (spySrtA) [15].

Structural Analyses of Ancestral SrtA Proteins
In order to better interpret our biochemical data, we used homology modeling to predict the structures of our ancestral proteins [41][42][43].The template structure used for ancStaphSrtA was saSrtA-LPAT* (PDB ID 2KID), as this is the only known saSrtA structure in the active conformation, i.e., including Ca 2+ , to our knowledge [44].For ancStrepSrtA, we used a structure of spySrtA bound to the LPATS target sequence (PDB ID 74S0) as the template.Next, we compared our ancestral proteins to the wild-type SrtA enzymes used in our activity assays, highlighting variant residues (Figure 5a,b).Comparison of ancStaphSrtA to saSrtA (PDB ID 2KID) revealed very few changes near the active sites of these proteins (Figure 5a).The most concentrated amino acid variations in the vicinity of the LPAT* peptidomimetic are in the 6-7 loop, with 6 differences amongst 16 residues.Notably, the saSrtA loop is 17 residues in length, so the ancStaphSrtA loop is truncated by one amino acid.This loop has previously been implicated in selectivity differences for saSrtA, suggesting that this loop variation may contribute to the over two-fold lower activity we see in ancStaphSrtA as compared to saSrtA (Figure 4) [38].
In our analysis of ancStrepSrtA and spSrtA, we used a previously generated homology model of spSrtA, as the only available structures are of a domain-swapped dimer whose activity has yet to be confirmed (Figure 5b) [14].Notably, alignment of our homology model with the predicted structure from the AlphaFold Protein Structure Database reveals a root mean squared deviation (RMSD) for main chain atoms of 0.501 Å (489 atoms), with the largest amount of variation in the 6-7 and 7-8 loops (Figure S2) [45,46].Although these sequences are less similar than the Staphylococcus proteins, we again observe relatively few amino acid substitutions in residues that directly interact with the ligand (Figure 5b).Here, we used the LPATS peptide from spySrtA-LPATS (PDB ID 7S40) for reference.We do observe amino acid variants in the 7-8 loop residues of ancStrepSrtA as compared to spSrtA that may explain the increased activity of the ancestral protein.As we have previously described, an interaction between the 6 −2 (or two residues from the C-terminus of the 6 strand) R184 and two residues in the spSrtA 7-8 loop, 7-8 +1 (or 1 residue C-terminal to the catalytic Cys) E208 and 7-8 −1 (or 1 residue N-terminal to the catalytic Arg) E214, weakens the overall activity of spSrtA [14].In contrast, spySrtA does not contain this interaction and shows much higher relative activity [15].AncStrepSrtA contains a 7-8 +1 Thr, 7-8 −1 Gln, and 6 −2 Val, suggesting that this interaction is also not present in this protein (Figure 5c).We do, however, observe that ancStrepSrtA likely conserves the two favorable interactions previously described that are mediated by 7-8 loop residues, including an intra-loop hydrogen bond between 7-8 +2 Asp and 7-8 +6 Thr, as well as a hydrophobic interaction between the 7-8 +3 Tyr residue and 4-5 +3 Phe (or three residues C-terminal to the catalytic His) (Figure 5c) [14,15].

Investigating Ancestral Proteins at Distant Nodes
Finally, we wanted to test the activity of ancestral SrtA proteins at more distant nodes in our ASR analyses.We chose three sequences with relatively low sequence identity to ancStaphSrtA and ancStrepSrtA that were also distinct from each other (Figure 2, Table 2).All protein sequences are in the Supplemental Information.We named the proteins for their node characterization in the ASR, ancNode-408, ancNode-503, and ancNode-547.We expressed and purified these proteins as described in the Materials and Methods.Notably, only fractions corresponding to the monomeric peak were retained following size exclusion chromatography, and based on their migration, these proteins are not aggregated and retain a similar radius of gyration as the wild-type proteins (Figure S3).Unfortunately, when evaluated using our FRET-based assay, all three proteins were catalytically inactive for sequences containing P1′ Ala, Gly, and Ser residues.Multiple sequence alignment of the ancestral proteins in this study suggests why these proteins may be catalytically inactive (Figure 6).Specifically, the manual refinement of the multiple sequence alignment used for ASR aimed to reduce numbers of gaps in the overall alignment, thereby optimizing alignments in areas of conserved secondary structure elements, e.g., the eight-stranded -barrel structure conserved in the characterized sortase fold [9,14].In doing so, we predict this introduced gaps in the structurally-conserved loops near the active site, e.g., the 4-5 and 7-8 loops previously mentioned here (Figure 6).The 6-7 loop appears largely conserved in length, perhaps indicative of a higher degree of length conservation in this structural feature, as well as the 3-4 loop, which, while spatially more distant from the active site, contains residues previously implicated in ligand recognition (Figure 6) [44].The loop lengths of the 7-8 loops of Class A sortases can vary quite dramatically.We previously characterized the 7-8 loops of several Class A sortases which varied from 7 residues in Streptococcus proteins to 12 residues in Staphylococcus aureus [14].These differences are also seen in the 4-5 loop sequences, where saSrtA contains an additional three residues between the catalytic His residue and a conserved Phe residue as compared to spySrtA.These differences in lengths of critical loops likely hindered our ability to accurately reconstruct ancestral proteins at nodes that included multiple genera as descendants.

Discussion
In this work, we used bioinformatics tools, PCA and ASR, to investigate sequence variation in sortases.We used PCA to identify sources of variation in the sortase superfamily, recapitulating previous work that characterized the different classes based on distinct sequence properties [27,47].In addition, we found that within each class, the largest variation typically occurs in structurally conserved loops, including those near the active site of the enzymes (Figure 1).
These loops were further implicated as sources of relatively high variation in our ASR biochemical studies.Here, while we were able to express and purify enzymatically active ancestral proteins of the Staphylococcus and Streptococcus genera, ancestral sequences of nodes that combined multiple genera were catalytically inactive.We predict that this is due to truncations in the 4-5 and 6-7 loops as a result of manual multiple sequence alignment refinement during ASR (Figure 6).Despite this, our ancStrepSrtA protein was the second-most active Streptococcus protein of four studied using similar activity assays [14,15].We also find that ancStrepSrtA contains a similar degree of variation, ~30%, as other Streptococcus proteins with each other (Table 1b).For example, spySrtA shares 65% sequence identity with sagSrtA (109/168 residues), and 63% (105/166) with spSrtA.SpSrtA and sagSrtA are 57% identical (95/167).In contrast, spSrtA, spySrtA, and sagSrtA are 34% (34/121), 26% (38/148), and 35% (42/119) identical to saSrtA, respectively.
While ancestral proteins at deep nodes that included multiple genera as descendants were found to be inactive, the fact that they were able to be expressed and purified using the same methods as those used for extant proteins suggested that the central sortase fold remained intact.Future work to repeat the ASR with careful attention to loop lengths, as well as introduction of extant 4-5 and 6-7 loops into ancestral proteins, could provide a means for restoring activity to these enzymes, and may elucidate additional molecular characteristics of the contribution of these individual regions to the activity and selectivity of sortases.Such information would be very useful in future design efforts for sortase enzymes with improved catalytic efficiency or altered specificity.It would also be interesting to perform structural studies on these ancestral proteins, providing insights into potential differences compared to extant proteins with respect to the stereochemistry of target recognition.
There are a number of potential tools that can be used to examine sequence variation in bacterial sortases.Here, we utilized network and evolutionary approaches to investigate natural sequence variation.We argue that with the existence of thousands of sortase enzymes in multiple classes, there is still much to be discovered in extant sortase sequences [27,48].In addition, directed evolution has proved to be an exciting technique to engineer sortase variation in vitro [17,21,49].Both approaches, investigating natural sequences, as well as introducing new variation, will allow for a deeper understanding of the sequence determinants of activity and target selectivity, and can profoundly impact the study of the sortase enzyme family, both in protein engineering and for therapeutic uses.

Materials and Methods
Principal component analysis (PCA).Initial sequences were obtained from UniProt and an alignment was generated by MAFFT [32,33].Initially, each sequence was given a score for the number of gaps present for each residue and the filtered alignment was realigned by MAFFT version 7. Subsequent analysis included all sequences without taking gaps into consideration (Figure 2b vs. Figures 2a and S2c).The sortase multiple sequence alignment (MSA) was converted to a tensor of sequences, characterized by MSA position and chemical property of each amino acid.Each amino acid was associated with 4 biochemical traits and a binary trait occupancy, as described.Each trait was normalized to the range from zero to one.In addition, gaps were given the average value of the matrix column with the exception of occupancy so that they would not contribute to variance of the column.Gapped positions were given an occupancy score of zero (for the other chemical properties gapped positions received the average score).After translating the MSA, the resulting tensor was flattened to matrix stacking of the chemical properties and was re-centered so that the matrix had a column-wise mean of zero.Principal component analysis was performed on the matrix by the singular value decomposition algorithm provided in the scikit learn Python package [50].Clustering was performed by a Gaussian mixture model provided in the scikit-learn 1.1 Python package [50].Optimal cluster numbers were scored by Bayesian information criterion.Visualization was performed using a script written in Python with matplotlib.Programs were run using default parameters, unless otherwise noted.
Ancestral Sequence Reconstruction.Nonredundant sortase sequences were sourced from the NCBI protein database [51].Cluster Database at High Intensity with Tolerance program (CD-HIT) was used to filter out highly similar (>95%) identical sequences sourced from NCBI [52,53].An all-vs-all basic local alignment search tool (BLAST) was used on the remaining sortase sequences, producing a sortase network which informed the assignment of sortase class groups (A-F) by using labeled sortase sequences to assign a class to each grouping [54].Proteins surrounding the class A group were selected and an additional round of filtering was performed, where all highly similar proteins (>90%) were filtered out via CD-HIT.The remaining pool of sortase sequences was then subjected to alignment by MUltiple Sequence Comparison by Log-Expectation (MUSCLE), and then manually curated to remove any outlying sequences [55].Seven Class B sortase sequences (from Streptococcus suis, Streptococcus oralis, Streptococcus pneumoniae, Staphylococcus aureus, Bacillus anthracis, Listeria monocytogenes, and Enterococcus faecalis) were added to anchor the resulting phylogenetic tree.The final alignment contained a total of 400 sequences.SrtA structures sourced from the PDB database were structurally aligned and sequence similarity between structural sequences (via PDB) and sortase sequences from the multisequence alignment (MSA) (via ASR) then informed the true alignment of the MSA.A phylogenetic tree was constructed from the MSA via phyml 3.0 and ancestral sequences were then generated at each node via multi-channel access XML (maxml) [56].These latter steps were run using a python script.The aLRT values for proteins characterized were ancStaphSrtA = 15.6525,ancStrepSrtA = 13.0091,ancNode-408 = 17.7893, ancNode-503 = 28.8809, and ancNode-547 = 17.5286.Programs were run using default parameters, unless otherwise noted.
Following immobilized metal affinity chromatography, the protein was concentrated using an Amicon Ultra-15 Centrifugal Filter Unit (10,000 NWML) followed by size exclusion chromatography (SEC) using a HiLoad 16/600 Superdex 75 column (Cytiva), with SEC running buffer [0.05 M Tris pH 7.5, 0.15 M NaCl, 0.001 M TCEP].Purified protein fractions corresponding to the monomeric peak were pooled and concentrated.Purity was assessed using SDS-PAGE.Protein concentrations were determined using theoretical extinction coefficients calculated using ExPASy ProtParam [57].Protein not immediately used was flash-frozen in SEC running buffer and stored at −80 C.
Fluorescence Assay for Sortase Activity.Model peptide substrates with the general structure Abz-LPATXG-K(Dnp) (Abz = 2-aminobenzoyl, Dnp = 2,4-dinitrophenyl) were synthesized and purified as previously described [14].Reactions were analyzed using a Biotek Synergy H1 plate reader as previously described [14,15].Briefly, reactions were performed a 100 μL reaction volume consisting of 5 μM sortase, 50 μM peptide substrate, 5 mM hydroxylamine nucleophile, and 10% (v/v) 10x sortase reaction buffer (500 mM Tris pH 7.5, 1500 mM NaCl, and 100 mM CaCl2).The reactions were performed in triplicate and the fluorescence intensity of each well was measured at 2-min time intervals over a 2hour period at room temperature (ex = 320 nm, em = 420 nm, and detector gain = 75).For each substrate sequence, the background fluorescence of the intact peptide in the absence of enzyme was subtracted from the observed experimental data.Background-corrected fluorescence data was then normalized to the fluorescence intensity of a benchmark reaction between wild-type saSrtA and Abz-LPATGG-K(Dnp), as previously described [14,15].

Figure 1 .
Figure 1.Principal component analysis (PCA) of sortase superfamily reveals sequence variability in structurally-conserved loops.(a) Hierarchical clustering of the sortase superfamily by Gaussian mixture model unsupervised classification on the PCA matrix distinguishes the known classes of sortases [36].(b) Visualization of a subset of 9,427 of the high confidence sortase sequences plotted using principal components 1-3, to act as quality control of the data.The sequence is colored by which class of sortase it is annotated as by UniProt, when available.An equivalent plot of all 39,188 sortase sequences is in Figure S1d.(c,d) The five characteristics assigned numerical values in the PCA are visualized by width (from 0 to 1) and color (where dark blue is a value closer to 0 and green indicates a value closer to 1 using PyMOL.The Streptococcus pyogenes SrtA structure (PDB ID 3FN5)is used as the model to show variance in a "typical Class A sortase."The catalytic residues (H142, C208, and R216) are shown as side chain sticks and colored by heteroatom.The three structurally conserved loops that are discussed in this work are labeled.We focused on variance near the active site here, but notably, there is also a relatively large degree of variance on the other side of the protein (also FigureS1d).

Figure 2 .
Figure 2. Constructed phylogenetic tree of SrtA sequences used for ancestral sequence reconstruction.The locations of ancStaphSrtA (also, inset) and ancStrepSrtA are indicated by red arrows.More ancestral proteins are indicated by asterisks; moving from left to right, these are ancNode-408, an-cNode-503, and ancNode-547.

Figure 4 .
Figure 4. Relative enzyme activities for ancestral proteins.Substrate selectivity data for ancStaphSrtA (a) and ancStrepSrtA (b), proteins are shown as bar graphs.Substrate cleavage monitored via an increase in fluorescence at 420 nm from reactions of fluorophore-quencher probes with the generic structure Abz-LPATXG-K(Dnp) (LPATX) in the presence of hydroxylamine.Bar graphs represent mean normalized fluorescence (± standard deviation) from at least three independent experiments, and average activity values over 10% relative activity are labeled.Assays are normalized against a previous set of reactions of saSrtA with LPATG [14], to ensure consistency across data sets.Wild-type SaSrtA only recognizes LPATG, as previously shown [25].

Figure 5 .
Figure 5. Structural comparison of ancestral protein models with wild-type SrtA proteins.The amino acid differences between wild-type saSrtA/ancStaphSrtA (a) and wild-type spSrtA/ancStrepSrtA (b) are represented as side chain spheres and colored as labeled.All four proteins are shown in gray cartoon representation with identical side chains as sticks.Ligands are shown in black sticks and colored by heteroatom (C = black, O = red, N = blue) and are the peptidomimetic LPAT* (a) and LPATS (b).Relevant PDB ID codes are in parentheses.(c) Interactions mediated by 7-8 residues described in the text are highlighted by color, and the X and checkmarks are used to indicate whether they are predicted to be present.The ancStrepSrtA homology model is in gray cartoon, with the LPATS peptide (from PDB ID 7S40) rendered as in (b).Residues implicated in each interaction are shown as sticks and colored by heteroatom (1) 7-8 loop with 6 position interaction (C = yellow), (2) 7-8 intra-loop hydrogen bond (C = cyan), and (3) 7-8 loop with 4-5 position interaction (C = green).

Figure 6 .
Figure 6.Multiple sequence alignment of ancestral proteins used in this study.Multiple sequence alignment of the ancestral proteins used in this study reveals the largest sequence variations within the 4-5 and 6-7 loops.The loop regions are indicated in black boxes, moving left-to-right and top-to-bottom; the order of these are 3-5, 4-5, 6-7, and 7-8 loops.The His at the first position of the 4-5 loop, and the Cys and Arg at the first and last positions of the 7-8 loop, respectively, are the catalytic residues.Notably, the 7-8 loop of ancStaphSrtA does not align well with the other sequences, due to its increased length; this feature is also true of saSrtA with other SrtA proteins (data not shown).

Table 1 .
Sequence identities of ancStaphSrtA and ancStrepSrtA with extant sequences.Number of identical residues out of total residues in the alignment are in the parentheses.(a) Staphylococcus SrtA sequence identities.(b) Streptococcus SrtA sequence identities.
X Figure 3. Multiple sequence alignments of ancestral proteins with representative (a) Staphylococcus and (b) Streptococcus SrtA proteins.Multiple sequence alignments of example Staphylococcus and Streptococcus SrtA proteins reveal a number of regions with high similarity.

Table 2 .
Comparative pairwise sequence identities of ancestral proteins in this study.